Skip to content
This repository has been archived by the owner on Aug 30, 2019. It is now read-only.

Raphael/log errors #287

Merged
merged 3 commits into from
May 30, 2017
Merged

Raphael/log errors #287

merged 3 commits into from
May 30, 2017

Conversation

raphaelgavache
Copy link
Member

Problem

After a trace-agent crash /var/log/datadog/trace-agent.log did not contain the error.
In /var/log/datadog/supervisord.log you could find this error:

2017-05-22 20:15:01,832 INFO exited: trace-agent (exit status 2; not expected)
2017-05-22 20:15:02,482 INFO gave up: trace-agent entered FATAL state, too many start retries too quickly

And by running the trace agent manually you could see the goroutines spans.
More details on trello

Fix and test

The root of the problem is watchdog.Go() function : if you add a panic(...) in it https://github.com/DataDog/datadog-trace-agent/blob/master/watchdog/logonpanic.go#L49, you will reproduce the error.
Now if you go back to the old way (before factoring) of creating go routines

	go func() {
		defer watchdog.LogOnPanic()
 		server.Serve(stoppableListener)
	})
	}()

and add panics in it, the crash will properly log.

Did tests by running different versions of the trace-agent on one node of staging, adding panics and changing code.

Adding a panic at this exact line: https://github.com/DataDog/datadog-trace-agent/blob/master/model/statsraw.go#L211 reproduces the error explained in the trello card. Doing it with the reverts below will solve the issue (tested it and it does log)

…ting"

This reverts commit 641297c.

Conflicts:
	agent/endpoint.go
	agent/receiver.go
	watchdog/logonpanic.go
@raphaelgavache raphaelgavache requested a review from ufoot May 30, 2017 14:57
Copy link

@ufoot ufoot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very fine with reverting this watchdog.Go thingy, to be honest, I prefer the "manual" defer syntax, IMHO it's easier to understand what it does. Still, there's a conflict with master now (I'm the culprit, having changed some things in concentrator). Could you please resolve that conflict and re-submit the PR? This way I could really validate it. Thanks in advance.

agent/agent.go Outdated
// Need to do this computation before entering the concentrator
// as they access the Metrics map, which is not thread safe.
t.ComputeWeight(*root)
t.ComputeTopLevel()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks

@raphaelgavache raphaelgavache merged commit 0fdebf6 into master May 30, 2017
@LotharSee LotharSee added this to the 5.14 milestone Jun 6, 2017
@raphaelgavache raphaelgavache deleted the raphael/log-errors branch July 25, 2017 17:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants