-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Major parsing time regression in 0.34.x --terragrunt-iam-role update #1971
Comments
Hello, |
Thanks @denis256! |
Hi @denis256 & @yorinasub17 Also would be glad if there is a way to support you of course! |
Hi, I will try to submit a fix this week |
Hi @maunzCache
did the test include v0.35.18? since in that version was fixed issue which caused multiple auto inits
will be helpful to have example code to get better visibility on the issue, so far I tested with some abstract tests which didn't show a high time difference |
Thanks for the response @denis256 Here is the same suite with the newer binaries. We only updated some .hcl files to have fewer imports this is why the time for the 0.29.2 version decreased in comparison with the first post. And well, I don't see that problem fixed Single, small terraform module; one aws account
Single, but bigger terraform module; one aws account
13 terraform modules; one aws account
2/17/13 (32) terraform modules; three aws accounts
Edit: So now the problem is that my colleagues and i updated some of the code which make the runs appear faster. Please use the new 0.35.16 time above as reference instead so in relation there is still some time difference. For the example i'll see what i can come up with. I may need to simplify some customer code or create some bogus setup from the original source not to reveal an critical components and still keep the bug intact. By now i am not even sure if our code structure is just somehow messed up that it makes the huge time difference appear between the versions. |
I'll need some time to finish setting up sample code. During that i saw a lot of dependencies between some files which make the parsing time go sky rocket in comparison. Sooo maybe the issue is not only really related to the parsing itself but to the dependency management that has changed over time. At least that is something that i noticed during filling up this project with files. |
Hi @denis256 Also creating the dependency graph (see my example scripts in the initial post) will show you different execution times. I ran this repository with the following results: Running on my-env
The execution times are lower than in my initial post but this is mainly due how i minified that. Of course the original files will have a lot more variable magic such as for loops and includes. Still i think the results are interesting even for such a small scope. Not sure if a beefier machine will still show similar results. So you may need to stuff some bogus code into those .hcl files to make it slower in general. |
Hi, thanks for sharing - will look for the root cause |
Hi @denis256 |
Hi @maunzCache, provided code was helpful to identify that parsing takes more time, however, caching of Before:
After
|
Hi again, |
* Added caching of IAM role * Updated description * Replaced md5 checksum calculation with SHA-256
Hi everyone,
so a colleague of mine and I were trying to update to a newer terragrunt version because we are heavily outdated (currently 0.29.2). Of course for that you are doing some tests that everything works in general and we came across a major impact regarding our parsing times.
Just to give you a brief idea here are some measurements using terraform 0.14.11 (also outdated :/ )
Note: Runtime is the (assumed) parsing time only not the whole init!
Single, small terraform module; one aws account
Single, but bigger terraform module; one aws account
13 terraform modules; one aws account
2/17/13 (32) terraform modules; three aws accounts
(Edit: I thought it would be nice to gather a few more times. So i'll just edit them in as they happen.)
For reference this is my machine:
I am calling terragrunt via a wrapper script but here is the unbiased call
terragrunt_${terragrunt_version} run-all init --terragrunt-ignore-external-dependencies --terragrunt-parallelism 24
Edit: This is an updated version of the script measuring the same stuff but running a bit faster
That is in general the script what i use for measurement
It is okay that some minor time increases are introduced but for version 0.34.x we see a huge difference. Even though the table says 0.34.3 we tested it with the 0.34.0 as well. Worth to mention that 0.32.x also introduced a regression but in comparison it is a minor one.
Because we can clearly see which version introduced the regression it was rather easy to find the change which let to this. I am pretty confident that this was introduced in #667 (see merge commit of release). The additional call to PartialParseConfigString() in config/config.go really increases the time here.
As i am not that familiar with the code i cannot say what the actual cause is or what leads to that problem but i am certain that it is more than just parsing as it not duplicates the time by a factor of 2 but this is big . This is a huge problem for my customer. Right now we are also updating our *.hcl structure to get some performance in our project because we are around 35 minutes of parsing for a production deployment. That means updating to a newer terragrunt version right now would worsen it to ~90minutes of waiting. Even though we could go into hardware mode and use a beefy machine to plan and apply... well let's fix it like developers :)
Our general idea to increase performance on terragrunt was to introduce caching or to be more precise memoization for the PartialParseConfigString() function. For this i introduced usage of the library https://github.com/kofalt/go-memoize . However, even if the parsing had an improvement I am now at the point were i cannot know how to test for other regressions my change may introduce because I think reparsing for that initial feature for --terragrunt-iam-role is crucial so caching may has to been done at another point if possible at all.
I'd be happy for any suggestions how to deal with that topic any further. I can try to provide some code snipped of our infrastructure as code or maybe come up with a small reproduction setup for the regression. Still i wonder how this regression is triggered without using the flag at all....
Edit:
While providing more times to the tables above i just noticed a serious impact on execution as well in the bigger stacks. So for the 0.35.x version the same init that runs on 0.29.x in 6 minutes it will take ~1 hour instead. However, i am not sure if my init is currently throttled by any means so i'll retry that later or tomorrow for confirmation.
The text was updated successfully, but these errors were encountered: