You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As outlined in #1971 my customer still has issues with the parsing time of the later Terragrunt versions since 0.32. Still there is a requirement for my team to get a new version of Terraform and Terragrunt working ASAP. So @nilreml and I took some time to understand the Terragrunt code better and find a potential solution to our issue.
To get this straight: We are working on a big project containing ~150 AWS accounts, 6 environments and ~75 Terraform modules. The Terragrunt code is sometimes very complex containing multiple dependency layers. We already took the time to optimize and review all of the *.hcl files to reduce includes/read_terragrunt_config calls, calls to other custom functions and simplifying logic of loops. Still we only get minor improvements in our parsing times (time until all dependencies are resolved and Terraform is triggered the first time).
After the merge of #2010 and the 0.36.6 release another colleague of mine did some re-measuring of execution times in our environment:
tg version
init 1
init 2
init 3
plan 1
avg
factor to 0.29.2
0.29.2 parsing
0:01:11
0:00:50
0:01:14
0:01:02
64s
1.0
0.29.2 complete
0:04:11
0:04:32
0:04:13
0:04:14
257s
1.0
0.33.2 parsing
0:01:15
0:01:44
-
-
127s
2.0
0.33.2 complete
0:05:48
0:07:34
-
-
401s
1.6
0.34.3 parsing
0:05:54
0:05:21
-
-
337s
5.3
0.34.3 complete
0:24:59
0:21:02
-
-
1381s
5.4
0.35.20 parsing
0:04:50
0:04:34
-
-
282s
4.4
0.35.20 complete
0:23:39
0:23:14
-
-
1406s
5.4
0.36.3 parsing
0:04:23
0:04:44
-
-
274s
4.3
0.36.3 complete
0:24:26
0:21:57
-
-
1391s
5.4
0.36.6 parsing
0:01:16
0:02:09
0:01:52
0:02:11
112s
1.8
0.36.6 complete
0:07:26
0:11:29
0:07:31
0:08:54
530s
2.1
Unfortunately i don't know his testing method, but i can confirm that times are still high.
As you can see, there is a gap between 0.29 and 0.33 (0.32 to be precise) which the 0.36 version was unable to fix. So we did some code crawling on that version to see where the "evil code" was introduces and found changes between 0.31 and 0.32 to be the culprit.
To be super sure on this we created flamegraphs for the the versions in question:
Those flamegraphs have a hacked in pprof and were recompiled by me from their latest tag on git. Those profiles have been gathered during a terragrunt run-all init run.
Rationale
(Note: The sections below may mix up the functionality of include and dependency, however, every dependency will cause an implicit include due to required parsing of the dependencies.)
Due to introduction of multiple includes in 0.32 (presumably #1804) we found that the stack sizes between 0.31 and 0.32 drastically changed. This makes sense in general so Terragrunt would now parse includes multiple times and do another step of marshaling and unmarshaling HCL with JSON due to typing conflicts with cty in decode sections - at least from what we understood. Even though we cannot pin those changes to the PartialParseConfig calls, we saw at least increased calls to DecodeBaseBlocks which is called from PartialParseConfig. Also it could be related to changes in configFileHasDependencyBlock.
Our project makes use of a deep variable hierarchy using .hcl files, causing repeated parsing of .hcl files over and over for each module, in addition to the usual root terragrunth.hcl.
Also, our project has hundreds of module instances using terragrunt.hcl files via symbolic links.
Revisiting the changes of the setIAMRole function in config.go (#2010), we considered adding a cache/memoization to either DecodeBaseBlocks or PartialParseConfigString. We already knew that we can cache PartialParseConfigString just as setIamRole does, so this is our preferred option.
Comparing performance:
-
Best case
Average case
Worst case
Regular implementation of partial parsing
O(1)
O(n)
O(n^2)
Cached implementation
O(1)
O(n)
O(n)
With n being number of iterations for parsing dependencies/includes.
As of now we are not aware of any Terragrunt features the cause side-effects on PartialParseConfigString except for the already discussed setIAMRole.
Doing
To prove our points we did an initial implementation reusing the code of cache.go which was implemented for the setIAMRole feature. A switch was added to run the compiled version with and without caching behavior.
Additionally, two fixtures and two benchmarks were written to have a reproducible use cases.
The PR will follow up on this issue.
Please let's discuss your thoughts on our observations and suggested mitigation.
Just as i announced it on the PR, @nilreml will take over the topic as i leave the customer project. However, i believe the implementation is finished to if there is nothing to add from anyone you can give the PR a try and review it.
Motivation
As outlined in #1971 my customer still has issues with the parsing time of the later Terragrunt versions since 0.32. Still there is a requirement for my team to get a new version of Terraform and Terragrunt working ASAP. So @nilreml and I took some time to understand the Terragrunt code better and find a potential solution to our issue.
To get this straight: We are working on a big project containing ~150 AWS accounts, 6 environments and ~75 Terraform modules. The Terragrunt code is sometimes very complex containing multiple dependency layers. We already took the time to optimize and review all of the
*.hcl
files to reduceincludes/read_terragrunt_config
calls, calls to other custom functions and simplifying logic of loops. Still we only get minor improvements in our parsing times (time until all dependencies are resolved and Terraform is triggered the first time).After the merge of #2010 and the 0.36.6 release another colleague of mine did some re-measuring of execution times in our environment:
Unfortunately i don't know his testing method, but i can confirm that times are still high.
As you can see, there is a gap between 0.29 and 0.33 (0.32 to be precise) which the 0.36 version was unable to fix. So we did some code crawling on that version to see where the "evil code" was introduces and found changes between 0.31 and 0.32 to be the culprit.
To be super sure on this we created flamegraphs for the the versions in question:
Those flamegraphs have a hacked in pprof and were recompiled by me from their latest tag on git. Those profiles have been gathered during a
terragrunt run-all init
run.Rationale
(Note: The sections below may mix up the functionality of
include
anddependency
, however, every dependency will cause an implicit include due to required parsing of the dependencies.)Due to introduction of multiple includes in 0.32 (presumably #1804) we found that the stack sizes between 0.31 and 0.32 drastically changed. This makes sense in general so Terragrunt would now parse includes multiple times and do another step of marshaling and unmarshaling HCL with JSON due to typing conflicts with
cty
in decode sections - at least from what we understood. Even though we cannot pin those changes to thePartialParseConfig
calls, we saw at least increased calls toDecodeBaseBlocks
which is called fromPartialParseConfig
. Also it could be related to changes inconfigFileHasDependencyBlock
.Our project makes use of a deep variable hierarchy using
.hcl
files, causing repeated parsing of.hcl
files over and over for each module, in addition to the usual rootterragrunth.hcl
.Also, our project has hundreds of module instances using
terragrunt.hcl
files via symbolic links.Revisiting the changes of the
setIAMRole
function inconfig.go
(#2010), we considered adding a cache/memoization to eitherDecodeBaseBlocks
orPartialParseConfigString
. We already knew that we can cachePartialParseConfigString
just assetIamRole
does, so this is our preferred option.Comparing performance:
With
n
being number of iterations for parsing dependencies/includes.As of now we are not aware of any Terragrunt features the cause side-effects on
PartialParseConfigString
except for the already discussedsetIAMRole
.Doing
To prove our points we did an initial implementation reusing the code of
cache.go
which was implemented for thesetIAMRole
feature. A switch was added to run the compiled version with and without caching behavior.Additionally, two fixtures and two benchmarks were written to have a reproducible use cases.
The PR will follow up on this issue.
Please let's discuss your thoughts on our observations and suggested mitigation.
This story was co-created by @nilreml.
The text was updated successfully, but these errors were encountered: