Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major parsing time regression in 0.34.x --terragrunt-iam-role update #1971

Closed
maunzCache opened this issue Jan 10, 2022 · 12 comments · Fixed by #2010
Closed

Major parsing time regression in 0.34.x --terragrunt-iam-role update #1971

maunzCache opened this issue Jan 10, 2022 · 12 comments · Fixed by #2010
Assignees
Labels
bug Something isn't working p:high priority p:in progress Actively being worked on

Comments

@maunzCache
Copy link
Contributor

maunzCache commented Jan 10, 2022

Hi everyone,
so a colleague of mine and I were trying to update to a newer terragrunt version because we are heavily outdated (currently 0.29.2). Of course for that you are doing some tests that everything works in general and we came across a major impact regarding our parsing times.

Just to give you a brief idea here are some measurements using terraform 0.14.11 (also outdated :/ )

Note: Runtime is the (assumed) parsing time only not the whole init!

Single, small terraform module; one aws account

Version 0.29.2* 0.29.10 0.30.6 0.31.11 0.32.6 0.33.2 0.34.3 0.35.16
Runtime 0:01 0:01 0:01 0:01 0:02 0:02 0:10 0:10
Runtime compared to current*   = = = +0:01 +0:01 +0:09 +0:09

Single, but bigger terraform module; one aws account

Version 0.29.2* 0.29.10 0.30.6 0.31.11 0.32.6 0.33.2 0.34.3 0.35.16
Runtime 0:15 0:13 0:13 0:12 0:21 0:21 1:41 1:40
Runtime compared to current*   -0:02 -0:02 -0:03 +0:06 +0:06 +1:26 +1:25

13 terraform modules; one aws account

Version 0.29.2* 0.29.10 0.30.6 0.31.11 0.32.6 0.33.2 0.34.3 0.35.16
Runtime 0:45 0:51 0:49 0:45 1:22 1:20 6:20 5:57
Runtime compared to current*   +0:06 +0:04 = +0:37 +0:35 +5:27 +5:12

2/17/13 (32) terraform modules; three aws accounts

Version 0.29.2* 0.29.10 0.30.6 0.31.11 0.32.6 0.33.2 0.34.3 0.35.16
Runtime 1:01 1:02 0:59 0:57 1:48 1:49 7:06 7:02
Runtime compared to current*   +0:01 -0:02 -0:04 +0:47 +0:48 +6:05 +6:01

(Edit: I thought it would be nice to gather a few more times. So i'll just edit them in as they happen.)

For reference this is my machine:

LENOVO 20WM00AAGE
Microsoft Windows 10 Pro 10.0.19043 Build 19043
WSL 2 Kernel-Version: 5.10.60.1 (running Ubuntu 20.04)
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz, 2803 MHz, 4 Cores
16GB RAM

I am calling terragrunt via a wrapper script but here is the unbiased call
terragrunt_${terragrunt_version} run-all init --terragrunt-ignore-external-dependencies --terragrunt-parallelism 24

Edit: This is an updated version of the script measuring the same stuff but running a bit faster

#!/usr/bin/env bash

set -o pipefail

versions=("0.29.2" "0.29.10" "0.30.6" "0.31.11" "0.32.6" "0.33.2" "0.34.3" "0.35.16" "memo")

for terragrunt_version in ${versions[@]}; do
    # command="terragrunt_${terragrunt_version} terragrunt-info"
    command="terragrunt_${terragrunt_version} graph-dependencies --terragrunt-ignore-external-dependencies --terragrunt-parallelism 24"

    . awsume -a my-iam-role

    start=$(date +"%H:%M:%S")
    start_epoch=$(date --date "$start" +%s)
    echo "Running ${terragrunt_version}"
    echo "Start: ${start}"

    eval ${command}

    end=$(date +"%H:%M:%S")
    end_epoch=$(date --date "$end" +%s)
    echo "End: ${end}"

    time_diff_epoch=$((end_epoch - start_epoch))
    nice_diff=$(date -d@${time_diff_epoch} -u +%k:%M:%S | sed -s "s/ //")
    echo "${terragrunt_version} runtime: ${nice_diff}"

    echo "-----"
done

That is in general the script what i use for measurement

#!/usr/bin/env bash

set -o pipefail

versions=("0.29.2" "0.29.10" "0.30.6" "0.31.11" "0.32.6" "0.33.2" "0.34.3" "0.35.16")

terragrunt_parsing_time() {
    tg_parsing_start_date=$(grep -oP "(?<=^time\=)[0-9T:+-]+(?= Executing)" "$1") || {
        echo "No \"Executing\" message:"
        exit 1
    }
    tg_parsing_end_date=$(grep -oP "(?<=^time\=)[0-9T:+-]+(?= .*msg=(The stack|Stack))" "$1") || {
        echo "No \"Stack\" message:"
        exit 1
    }
    tg_parsing_start_epoch=$(date --date "$tg_parsing_start_date" +%s)
    tg_parsing_end_epoch=$(date --date "$tg_parsing_end_date" +%s)
    tg_parsing_seconds=$((tg_parsing_end_epoch - tg_parsing_start_epoch))
    # Write length of planning stage to stdout, align intention with other stats
    date -d@${tg_parsing_seconds} -u +%k:%M:%S | sed -s "s/ //"
}

for terragrunt_version in ${versions[@]}; do
    . awsume -a my-iam-role
    command="terragrunt_${terragrunt_version} run-all init --terragrunt-ignore-external-dependencies --terragrunt-parallelism 24"
    output_file="output_terragrunt_${terragrunt_version}"

    echo -e "\nterragrunt_${terragrunt_version}"
    echo -e "\ntime=$(date --iso-8601=seconds) Executing: ${command}\n" | tee ${output_file}
    eval "nice stdbuf -oL ${command}  2>&1 " | stdbuf -oL tee -a ${output_file}

    echo -e "\tFinished terragrunt_${terragrunt_version}\n"
    time_taken="$(terragrunt_parsing_time ${output_file})"
    echo "terragrunt parsing time: ${time_taken}" | tee -a ${output_file}

    # Create/append statistic files
    echo "terragrunt_${terragrunt_version}: ${time_taken}" | tee -a output_memo_stats
done

It is okay that some minor time increases are introduced but for version 0.34.x we see a huge difference. Even though the table says 0.34.3 we tested it with the 0.34.0 as well. Worth to mention that 0.32.x also introduced a regression but in comparison it is a minor one.

Because we can clearly see which version introduced the regression it was rather easy to find the change which let to this. I am pretty confident that this was introduced in #667 (see merge commit of release). The additional call to PartialParseConfigString() in config/config.go really increases the time here.

As i am not that familiar with the code i cannot say what the actual cause is or what leads to that problem but i am certain that it is more than just parsing as it not duplicates the time by a factor of 2 but this is big . This is a huge problem for my customer. Right now we are also updating our *.hcl structure to get some performance in our project because we are around 35 minutes of parsing for a production deployment. That means updating to a newer terragrunt version right now would worsen it to ~90minutes of waiting. Even though we could go into hardware mode and use a beefy machine to plan and apply... well let's fix it like developers :)

Our general idea to increase performance on terragrunt was to introduce caching or to be more precise memoization for the PartialParseConfigString() function. For this i introduced usage of the library https://github.com/kofalt/go-memoize . However, even if the parsing had an improvement I am now at the point were i cannot know how to test for other regressions my change may introduce because I think reparsing for that initial feature for --terragrunt-iam-role is crucial so caching may has to been done at another point if possible at all.

I'd be happy for any suggestions how to deal with that topic any further. I can try to provide some code snipped of our infrastructure as code or maybe come up with a small reproduction setup for the regression. Still i wonder how this regression is triggered without using the flag at all....

Edit:
While providing more times to the tables above i just noticed a serious impact on execution as well in the bigger stacks. So for the 0.35.x version the same init that runs on 0.29.x in 6 minutes it will take ~1 hour instead. However, i am not sure if my init is currently throttled by any means so i'll retry that later or tomorrow for confirmation.

@maunzCache maunzCache changed the title Major parsing time regression with introduction of --terragrunt-iam-role update Major parsing time regression in 0.34.x --terragrunt-iam-role update Jan 10, 2022
@denis256
Copy link
Member

denis256 commented Jan 10, 2022

Hello,
an interesting finding, will take a look into

@denis256 denis256 self-assigned this Jan 10, 2022
@brikis98
Copy link
Member

Thanks @denis256!

@yorinasub17 yorinasub17 added bug Something isn't working p:high priority labels Jan 13, 2022
@maunzCache
Copy link
Contributor Author

Hi @denis256 & @yorinasub17
Thank you for taking a look into it. I've seen that this has been labeled a bug with high priority. Have you had the chance to look up the matter and can give some information about the topic? This would help me a) understand what went wrong and b) plan the update process at my customer so that i can tell my Scrum Master what can be done next :)

Also would be glad if there is a way to support you of course!

@denis256
Copy link
Member

Hi, I will try to submit a fix this week

@denis256 denis256 added the p:in progress Actively being worked on label Jan 18, 2022
@denis256
Copy link
Member

Hi @maunzCache

0.35.x version the same init that runs on 0.29.x in 6 minutes it will take ~1 hour instead

did the test include v0.35.18? since in that version was fixed issue which caused multiple auto inits

I can try to provide some code snipped of our infrastructure as code or maybe come up with a small reproduction setup for the regression

will be helpful to have example code to get better visibility on the issue, so far I tested with some abstract tests which didn't show a high time difference

@maunzCache
Copy link
Contributor Author

maunzCache commented Jan 24, 2022

Thanks for the response @denis256
No, as written in the table above i only tested with <= 0.35.16 . So i may be able to run another suite today with that updated binary :)

Here is the same suite with the newer binaries. We only updated some .hcl files to have fewer imports this is why the time for the 0.29.2 version decreased in comparison with the first post.

And well, I don't see that problem fixed

Single, small terraform module; one aws account

Version 0.29.2* 0.35.16 0.35.18 0.35.20 0.36.0
Runtime 0:01 0:03 0:03 0:02 0:03
Runtime compared to current*   + 0:02 + 0:02 + 0:01 + 0:02

Single, but bigger terraform module; one aws account

Version 0.29.2* 0.35.16 0.35.18 0.35.20 0.36.0
Runtime 0:09 0:59 0:50 0:50 0:54
Runtime compared to current*   + 0:50 + 0:41 + 0:41 + 0:45

13 terraform modules; one aws account

Version 0.29.2* 0.35.16 0.35.18 0.35.20 0.36.0
Runtime 0:18 1:42 1:45 1:50 1:56
Runtime compared to current*   + 1:24 + 1:27 + 1:32 + 1:38

2/17/13 (32) terraform modules; three aws accounts

Version 0.29.2* 0.35.16 0.35.18 0.35.20 0.36.0
Runtime 0:29 3:00 2:45 3:00 2:55
Runtime compared to current*   + 2:31 + 2:16 + 2:31 + 2:26

Edit: So now the problem is that my colleagues and i updated some of the code which make the runs appear faster. Please use the new 0.35.16 time above as reference instead so in relation there is still some time difference.

For the example i'll see what i can come up with. I may need to simplify some customer code or create some bogus setup from the original source not to reveal an critical components and still keep the bug intact.

By now i am not even sure if our code structure is just somehow messed up that it makes the huge time difference appear between the versions.

@maunzCache
Copy link
Contributor Author

I'll need some time to finish setting up sample code. During that i saw a lot of dependencies between some files which make the parsing time go sky rocket in comparison. Sooo maybe the issue is not only really related to the parsing itself but to the dependency management that has changed over time. At least that is something that i noticed during filling up this project with files.

@maunzCache
Copy link
Contributor Author

maunzCache commented Jan 25, 2022

Hi @denis256
You may pull an example from https://github.com/maunzCache/terragrunt-issue-1971 .
This is what i could come up with that is closest to my issue. This setup will not allow you to execute an init fully because you need AWS credentials. Yet it should be enough to see the lag during parsing.

Also creating the dependency graph (see my example scripts in the initial post) will show you different execution times. I ran this repository with the following results:

Running on my-env

Version 0.29.2* 0.32.6 0.34.3 0.35.16 0.35.20 0.36.0
Runtime 0:03 0:06 0:12 0:14 0:13 0:14
Runtime compared to current*   + 0:03 + 0:09 + 0:11 + 0:10 + 0:11

The execution times are lower than in my initial post but this is mainly due how i minified that. Of course the original files will have a lot more variable magic such as for loops and includes. Still i think the results are interesting even for such a small scope.

Not sure if a beefier machine will still show similar results. So you may need to stuff some bogus code into those .hcl files to make it slower in general.

@denis256
Copy link
Member

Hi, thanks for sharing - will look for the root cause

@maunzCache
Copy link
Contributor Author

Hi @denis256
have you had any progress on the issue? My customer is asking me so i thought i'd also ask if the code example was any good.

@denis256
Copy link
Member

denis256 commented Feb 8, 2022

Hi @maunzCache, provided code was helpful to identify that parsing takes more time, however, caching of PartialParseConfigString led to issues with variable substitution, after checking flame graphs I implemented caching of IAMRole value obtained from the parsing of HCL files which reduced execution time from ~16.2s to ~8.3s

Before:

real    0m6.941s
user    0m15.873s
sys     0m0.562s

After

real    0m4.103s
user    0m8.019s
sys     0m0.352s

#2010

@denis256 denis256 linked a pull request Feb 8, 2022 that will close this issue
@maunzCache
Copy link
Contributor Author

maunzCache commented Mar 1, 2022

Hi again,
my colleague and i just noticed that we don't gave any feedback to you if that fix is solving the issue for us. From the source code point of view i think it will help, however, we will provide you some time measures in the upcoming days. As there is always something to work on i simply forget to work on that issue any further.

denis256 added a commit that referenced this issue Mar 16, 2022
* Added caching of IAM role

* Updated description

* Replaced md5 checksum calculation with SHA-256
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working p:high priority p:in progress Actively being worked on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants