Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

envconsul [options] <command>, command runs in a subshell #317

Open
simonellefsen opened this issue Nov 11, 2022 · 5 comments
Open

envconsul [options] <command>, command runs in a subshell #317

simonellefsen opened this issue Nov 11, 2022 · 5 comments

Comments

@simonellefsen
Copy link

Envconsul version

0.13.0 and 0.13.1

Command

./envconsul_0.12.1 -log-level=trace -kill-signal=SIGINT -exec-kill-signal=SIGINT -exec-kill-timeout=10s -upcase -secret _vault/aws/sts/internal-envconsul-example/ python3 -u ./program.py

./envconsul_0.13.0 -log-level=trace -kill-signal=SIGINT -exec-kill-signal=SIGINT -exec-kill-timeout=10s -upcase -secret _vault/aws/sts/internal-envconsul-example/ python3 -u ./program.py

./envconsul_0.13.1 -log-level=trace -kill-signal=SIGINT -exec-kill-signal=SIGINT -exec-kill-timeout=10s -upcase -secret _vault/aws/sts/internal-envconsul-example/ python3 -u ./program.py

Debug output

2934949 pts/12   Ss     0:00 -bash
2946182 pts/12   Sl+    0:00  \_ ./envconsul_0.12.1 -log-level=trace -kill-signal=SIGINT -exec-kill-signal=SIGINT -exec-kill-timeout=10s -upcase -secret _vault/aws/sts/internal-envconsul-example/ python3 -u ./program.py
2946191 pts/12   S      0:00      \_ python3 -u ./program.py

2948619 pts/17   Ss     0:00 -bash
2949574 pts/17   Sl+    0:00  \_ ./envconsul_0.13.0 -log-level=trace -kill-signal=SIGINT -exec-kill-signal=SIGINT -exec-kill-timeout=10s -upcase -secret _vault/aws/sts/internal-envconsul-example/ python3 -u ./program.py
2949588 pts/17   S      0:00      \_ sh -c python3 -u ./program.py
2949589 pts/17   S      0:00          \_ python3 -u ./program.py

2946451 pts/16   Ss     0:00 -bash
2948386 pts/16   Sl+    0:00  \_ ./envconsul_0.13.1 -log-level=trace -kill-signal=SIGINT -exec-kill-signal=SIGINT -exec-kill-timeout=10s -upcase -secret _vault/aws/sts/internal-envconsul-example/ python3 -u ./program.py
2948402 pts/16   S      0:00      \_ sh -c python3 -u ./program.py
2948403 pts/16   S      0:00          \_ python3 -u ./program.py

Expected behavior

Expected 0.13.0 and 0.13.1 to behave like version 0.12.1

Actual behavior

Versions 0.13.0 and 0.13.1 launch the python application in a subshell which can cause problem for signal handling and stopping the application to refresh credentials. If for some reason the application ignores the stop signal then envconsul will see the subshell stop but the application will still be running and envconsul will start a new instance of your application and you end up with an old application with old credentials and a new application with new fresh credentials.

@eikenb
Copy link
Contributor

eikenb commented Nov 11, 2022

Hey @simonellefsen, thanks for taking the time to file this!

The subshell signaling issue is handled by using Setpgid on the process to keep it and the shell in the same process group so all Signals will be sent to both. This was done specifically with the subshell to be sure to keep backwards compatibility with the old exec setup. Note that it only uses the subshell when given a "command with spaces" so the subshell can parse the command properly. If the command is a single "command" or specified in the config file as a ["command", "as", "a", "list"] then it will be run directly without the subshell.

You can read more about this in the consul-template exec-mode documentation. Note that envconsul uses consul-template as an underlying library.

I'm going to mark this as a documentation issue as the docs on this behavior were not updated when the version of consul-template was bumped to a version with this new behavior.

Please let me know if I misunderstood your concern or you have any other thoughts. Thanks.

@simonellefsen
Copy link
Author

ok - the Setpgid info is interesting.

After some more testing I then think there is something wrong when envconsul sends SIGKILL signal after kill timeout if the application ignores the default SIGTERM signal.

It looks like the default SIGTERM is sent to the process group but the SIGKILL seems to only be sent to the child, i.e. the subshell leaving the application running in the subshell still running.

./envconsul_0.13.1 -log-level=trace -kill-signal=SIGINT -exec-kill-signal=SIGINT -exec-kill-timeout=10s -upcase -secret _vault/aws/sts/internal-envconsul-example/ 'python3 -u ./program.py'
...
2022-11-11T20:24:24.165Z [INFO]  envconsul: (child) spawning: sh -c python3 -u ./program.py
2022-11-11 20:24:24,192 [INFO] Program started
2022-11-11 20:24:24,193 [INFO] Looping...
2022-11-11 20:24:24,193 [INFO] Dumping _VAULT and _CONSUL environment variables
2022-11-11 20:24:24,193 [INFO] _VAULT_AWS_STS_INTERNAL_ENVCONSUL_EXAMPLE__ACCESS_KEY=ASIATLE54AY3HYROC6W6

The process tree look like

$ ps fj
   PPID     PID    PGID     SID TTY        TPGID STAT   UID   TIME COMMAND
3807404 3807409 3807409 3807409 pts/0    3844555 Ss    5004   0:00 -bash
3807409 3844555 3844555 3807409 pts/0    3844555 Sl+   5004   0:00  \_ ./envconsul_0.13.1 -log-level=trace -kill-signal=SIGINT -exec-kill-signal=SIGINT -exec-kill-timeout=10s -upcase -secret _vault/aws/sts/internal-envcons
3844555 3844569 3844569 3807409 pts/0    3844555 S     5004   0:00      \_ sh -c python3 -u ./program.py
3844569 3844570 3844569 3807409 pts/0    3844555 S     5004   0:00          \_ python3 -u ./program.py

So yes, you are right the subshell and application is the same process group 3844569

But when I sent a SIGINT signal to envconsul I then get the following output

^C2022-11-11T20:28:15.551Z [DEBUG] envconsul.cli: receiving signal: signal=interrupt
Cleaning up...
2022-11-11T20:28:15.551Z [INFO]  envconsul.runner: stopping
2022-11-11T20:28:15.551Z [DEBUG] envconsul.runner: stopping watcher
2022-11-11T20:28:15.551Z [DEBUG] envconsul: (watcher) stopping all views
2022-11-11T20:28:15.551Z [TRACE] envconsul: (watcher) stopping vault.read(_vault/aws/sts/internal-envconsul-example)
2022-11-11T20:28:15.551Z [DEBUG] envconsul.runner: stopping vault token watcher
2022-11-11T20:28:15.551Z [DEBUG] envconsul: (watcher) stopping all views
2022-11-11T20:28:15.551Z [TRACE] envconsul: (watcher) stopping vault.token
2022-11-11T20:28:15.551Z [DEBUG] envconsul.runner: stopping child process
2022-11-11T20:28:15.551Z [INFO]  envconsul: (child) stopping process
2022-11-11T20:28:15.551Z [TRACE] envconsul: (view) vault.read(_vault/aws/sts/internal-envconsul-example) reported stop
2022-11-11T20:28:15.551Z [TRACE] envconsul: (view) vault.token stopping poll (received on view stopCh)
2022-11-11T20:28:15.551Z [TRACE] envconsul: (view) vault.read(_vault/aws/sts/internal-envconsul-example) stopping poll (received on view stopCh)
2022-11-11 20:28:15,551 [INFO] Signal handler called with signal 2
2022-11-11 20:28:15,552 [INFO] Ignoring signal 2

envconsul waits for "-exec-kill-timeout=10s" because the application ignored the signal and then I assume envonsul sent a SIGKILL signal, but this signal seems to have only reached the child (subshell), so when I list my processes my application is still running

$ ps fj
   PPID     PID    PGID     SID TTY        TPGID STAT   UID   TIME COMMAND
3807404 3807409 3807409 3807409 pts/0    3807409 Ss+   5004   0:00 -bash
      1 3844570 3844569 3807409 pts/0    3807409 S     5004   0:00 python3 -u ./program.py

Now as an orphaned child

@simonellefsen
Copy link
Author

So if I look at consul-template code

https://github.com/hashicorp/consul-template/blob/b0fc7625040630036f3fab8b4df418d191096a08/child/child.go#L354-L356

It very much looks like the force kill signal is not sent to the process group but for the default kill signal https://github.com/hashicorp/consul-template/blob/b0fc7625040630036f3fab8b4df418d191096a08/child/child.go#L397-L400 this is sent to the process group

@eikenb
Copy link
Contributor

eikenb commented Nov 12, 2022

That's for pointing that out. Definitely appears to be a bug at first look. It looks like that use should probably be calling the child's internal kill() method so it'd pick up on the setpgid setting as the internal Process.Kill command doesn't take that into account.

You want to file that bug or would you like me to?

Thanks for finding that issue @simonellefsen!!

@eikenb
Copy link
Contributor

eikenb commented Nov 14, 2022

@simonellefsen, thanks for filing the issue with consul-template.. hashicorp/consul-template#1666

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants