Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve nsexec logging #2487

Closed

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Jun 25, 2020

This is a carry of #2460 with a few fixes on top:

Keeping this one a draft until #2835 is merged.

Closes: #2460

@kolyshkin
Copy link
Contributor Author

Current theory: added debug logging revealed some kind of a race in the code, which is now failing.

@kolyshkin kolyshkin force-pushed the improve-nsexec-logging branch 5 times, most recently from 2cf1faa to 8f872aa Compare July 14, 2020 02:58
@kolyshkin

This comment has been minimized.

@kolyshkin kolyshkin force-pushed the improve-nsexec-logging branch from 8f872aa to 707d9db Compare July 14, 2020 03:53
@kolyshkin
Copy link
Contributor Author

OK, apparently the golang runtime just ends earlier than the nsexec part, so some part of logs are not printed. Perhaps we need to wait for ForwardLogs to finish.

@kolyshkin
Copy link
Contributor Author

Current theory: added debug logging revealed some kind of a race in the code, which is now failing.

yes, basically it's a race between the logger goroutine and the main thread.

@kolyshkin kolyshkin force-pushed the improve-nsexec-logging branch from 707d9db to 9a1afdf Compare July 14, 2020 20:43
@kolyshkin
Copy link
Contributor Author

OK, it looks like I have fixed the tests. Let me split the last commit into smaller pieces and I guess this can go in.

@kolyshkin kolyshkin force-pushed the improve-nsexec-logging branch 2 times, most recently from 634317d to b03b723 Compare July 15, 2020 01:08
@kolyshkin kolyshkin changed the title [do not merge] Improve nsexec logging Improve nsexec logging Jul 15, 2020
@kolyshkin kolyshkin force-pushed the improve-nsexec-logging branch 4 times, most recently from 05a3cf5 to 9bae011 Compare July 16, 2020 02:28
@kolyshkin kolyshkin marked this pull request as ready for review July 16, 2020 03:19
@kolyshkin
Copy link
Contributor Author

This one is ready, @cyphar @mrunalp PTAL

factory, err := libcontainer.New("")
if err != nil {
logrus.Fatalf("unable to initialize for container: %s", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logrus is replaced by panic, because we just closed the logrus file descriptor. This is a test case so doesn't matter much.

@@ -19,12 +23,40 @@ func init() {
}
runtime.GOMAXPROCS(1)
runtime.LockOSThread()

logLevel := logrus.InfoLevel
if level := os.Getenv("_LIBCONTAINER_LOGLEVEL"); level != "" {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was made optional because most (or all) of the test cases do not set Process.LogLevel and it is easier to have the default here rather than to amend all the tests.

Alternatively, the default can be moved somewhere to Process.Start

@kolyshkin
Copy link
Contributor Author

TODO: close the init's logPipeFd as late as possible, so runc init will be able to use logrus.Debugf on a later stage. It works, just have to figure out where to close it.

@kolyshkin kolyshkin closed this Jul 21, 2020
@kolyshkin kolyshkin reopened this Jul 21, 2020
@kolyshkin kolyshkin force-pushed the improve-nsexec-logging branch from 9bae011 to f3c34ce Compare July 31, 2020 03:38
@kolyshkin kolyshkin marked this pull request as draft July 31, 2020 03:53
@kolyshkin kolyshkin force-pushed the improve-nsexec-logging branch 2 times, most recently from cd4dca6 to f9b4cf0 Compare August 9, 2020 01:56
@kolyshkin kolyshkin mentioned this pull request Aug 9, 2020
@kolyshkin kolyshkin force-pushed the improve-nsexec-logging branch from f9b4cf0 to b79c8f5 Compare November 28, 2020 07:35
@kolyshkin
Copy link
Contributor Author

Rebased on top of the current master. Don't remember what was I stuck as last time I was working on it.

@thaJeztah
Copy link
Member

At least one failure;

TestLogForwardingDoesNotStopOnLogLevelParsingErr
panic: close of closed channel

goroutine 11 [running]:
github.com/opencontainers/runc/libcontainer/logs.ForwardLogs(0x5af480, 0xc0001df100)
	/go/src/github.com/opencontainers/runc/libcontainer/logs/logs.go:46 +0x21a
github.com/opencontainers/runc/libcontainer/logs.startLogForwarding.func1(0xc0001df100, 0xc000018540)
	/go/src/github.com/opencontainers/runc/libcontainer/logs/logs_linux_test.go:120 +0x37
created by github.com/opencontainers/runc/libcontainer/logs.startLogForwarding
	/go/src/github.com/opencontainers/runc/libcontainer/logs/logs_linux_test.go:119 +0xf2
FAIL	github.com/opencontainers/runc/libcontainer/logs	0.040s

@kolyshkin kolyshkin force-pushed the improve-nsexec-logging branch 7 times, most recently from 703bef9 to 4eaf907 Compare March 6, 2021 04:19
@kolyshkin
Copy link
Contributor Author

I was running in circles around it but it looks like I finally made it working.

The fix itself (for the flakes in debug.bats, i.e. #2460) is actually pretty simple:

  1. Do close log pipe writing side file descriptor (i.e. _LIBCONTAINER_LOGPIPE) in runc init;
  2. Do wait for logs.ForwardLogs() to finish.

Before this PR, the logpipe fd was only closed upon successful execve() in runc init (thanks to CLOEXEC), but it was not closing in case of runc start (as runc init sits there waiting for exec.fifo) and in cases of various errors before or in execve(). Consequently, there was no mechanism to wait for ForwardLogs to finish, and if there would be, it won't finish anyway because the pipe was kept open.

So, it all works now but I'm keeping it as a draft because I have yet to convert this to a set of nice and clean commits.

It does not make sense to have a string for a numeric type.

Signed-off-by: Kir Kolyshkin <[email protected]>
1. Remove redundant "echo $output" from the first test case, as "runc"
   helper function already logs the output.

2. Show the contents of log.out to stderr, so it case of error we can
   see what is going on.

3. Remove the check that `log.out` file exists. This check is redundant,
   since right after it we do `cat log.out` and check its exit code.

4. Factor out common checks into check_debug.

Signed-off-by: Kir Kolyshkin <[email protected]>
Error handling is slightly cleaner this way.

While at it, do minor refactoring and fix error logging
in processEntry.

Signed-off-by: Kir Kolyshkin <[email protected]>
This fixes a big red warning in my vim.

Signed-off-by: Kir Kolyshkin <[email protected]>
I have noticed that ConfigureLogs do not return an error in case logging
was already configured -- instead it just warns about it. So I went
ahead and changed the warning to the actual error...

... only to discover I broke things badly, because in case of runc init
logging is configured twice. The fix is to not configure logging in case
we are init.

Signed-off-by: Kir Kolyshkin <[email protected]>
Sometimes debug.bats test cases are failing like this:

> not ok 27 global --debug to --log --log-format 'json'
> # (in test file tests/integration/debug.bats, line 77)
> #   `[[ "${output}" == *"child process in init()"* ]]' failed

It happens more when writing to disk.

This issue is caused by the fact that runc spawns log forwarding goroutine
(ForwardLogs) but does not wait for it to finish, resulting in missing
debug lines from nsexec.

ForwardLogs itself, though, never finishes, because it reads from a
reading side of a pipe which writing side is not closed. This is
especially true in case of runc create, which spawns runc init and
exits; meanwhile runc init waits on exec fifo for some undetermined long
time before doing execve.

So, to fix the failure described above, we need to:

 1. Make runc create/run/exec wait for ForwardLogs to finish;

 2. Make runc init close its log pipe file descriptor (i.e.
    the one which value is passed in _LIBCONTAINER_LOGPIPE
    environment variable).

This is exactly what this commit does. In addition, it

 - adds logrus debug to late stages of runc init, and amends the
   integration tests to check for those messages;

 - adds runc --debug exec test cases, similar to those in debug.bats
   but for runc exec eather than runc run;

 - refactors libcotainer/logs unit tests to simplify and update for
   the new ForwardLogs.

PS I have to admit I still do not understand why closing log pipe fd
is required in e.g. (*linuxSetnsInit).Init, right before the execve
which (thanks to CLOEXEC) closes the fd anyway.

Signed-off-by: Kir Kolyshkin <[email protected]>
cyphar and others added 2 commits March 5, 2021 21:50
In order to make 'runc --debug' actually useful for debugging nsexec
bugs, provide information about all the internal operations when in
debug mode.

[@kolyshkin: auto-fix formatting using indent for make validate to pass]

Signed-off-by: Aleksa Sarai <[email protected]>
Signed-off-by: Kir Kolyshkin <[email protected]>
Some strings logged by write_log() contain \n, which leads to errors
like this one:

> # time="2020-06-07T15:41:37Z" level=error msg="failed to decode \"{\\\"level\\\":\\\"debug\\\", \\\"msg\\\": \\\"nsexec-0[2265]: update /proc/2266/uid_map to '0 1000 1\\n\" to json: invalid character '\\n' in string literal"

The fix is to escape such characters.

Signed-off-by: Kir Kolyshkin <[email protected]>
@kolyshkin kolyshkin force-pushed the improve-nsexec-logging branch from 4eaf907 to fce08a9 Compare March 6, 2021 05:52
@kolyshkin
Copy link
Contributor Author

will open another one

@kolyshkin kolyshkin closed this Mar 6, 2021
@kolyshkin kolyshkin mentioned this pull request Mar 6, 2021
@kolyshkin
Copy link
Contributor Author

Split into #2835 and #2836

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants