This repository has been archived by the owner on Aug 30, 2019. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
After a trace-agent crash
/var/log/datadog/trace-agent.log
did not contain the error.In
/var/log/datadog/supervisord.log
you could find this error:And by running the trace agent manually you could see the goroutines spans.
More details on trello
Fix and test
The root of the problem is watchdog.Go() function : if you add a
panic(...)
in it https://github.com/DataDog/datadog-trace-agent/blob/master/watchdog/logonpanic.go#L49, you will reproduce the error.Now if you go back to the old way (before factoring) of creating go routines
and add panics in it, the crash will properly log.
Did tests by running different versions of the trace-agent on one node of staging, adding panics and changing code.
Adding a panic at this exact line: https://github.com/DataDog/datadog-trace-agent/blob/master/model/statsraw.go#L211 reproduces the error explained in the trello card. Doing it with the reverts below will solve the issue (tested it and it does log)