-
Notifications
You must be signed in to change notification settings - Fork 31
Always drain requests before responding #341
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do more testing as we've discussed, though the approach should fix the problem with the pre-sampler.
@@ -195,6 +196,7 @@ func (r *HTTPReceiver) replyTraces(v APIVersion, w http.ResponseWriter) { | |||
// handleTraces knows how to handle a bunch of traces | |||
func (r *HTTPReceiver) handleTraces(v APIVersion, w http.ResponseWriter, req *http.Request) { | |||
if !r.preSampler.Sample(req) { | |||
io.Copy(ioutil.Discard, req.Body) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
trace := model.Trace{} | ||
for j := 0; j < size; j++ { | ||
trace = append(trace, GetTestSpan()) | ||
span := GetTestSpan() | ||
if realisticIDs { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced this because I noticed in some tests, when in debug mode, you'd see many messages in the log saying traces had been dropped because they had spans with the same ID. And, this is legit. It's a nitpick, but not having realistic traces makes tests less realistic and so they might miss interesting real-world cases.
@@ -519,17 +523,80 @@ func TestHandleTraces(t *testing.T) { | |||
ts, ok := rs.Stats[info.Tags{Lang: lang}] | |||
assert.True(ok) | |||
assert.Equal(int64(20), ts.TracesReceived) | |||
assert.Equal(int64(57622), ts.TracesBytes) | |||
assert.Equal(int64(59222), ts.TracesBytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slight increase because trace IDs and span IDs are longer. It's still predictable (seeded random) so no flakiness.
// chunkedReader is a reader which forces partial reads, this is required | ||
// to trigger some network related bugs, such as body not being read fully by server. | ||
// Without this, all the data could be read/written at once, not triggering the issue. | ||
type chunkedReader struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That, was the tricky part. Without this it's impossible to reproduce the bug.
}} | ||
for i := 0; i < 3; i++ { | ||
wg.Add(1) | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally sure this goroutine thing is totally mandatory, but it's more realistic anyway and does not cost that much.
|
||
// Make sure we use share clients, and they are reused. | ||
client := &http.Client{Transport: &http.Transport{ | ||
MaxIdleConnsPerHost: 100, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increasing MaxIdleConnsPerHost
is common practice to ensure the connection is reused.
Great job for finding a way to trigger this bug 👍 Though I'm not sure that it is exactly the same bug we've been observing in real use cases. The bug triggered in this test disappears if we don't send requests concurrently from several goroutines, while the behavior of The two bugs look extremely similar and certainly share a common cause since this patch fixes both. I would like to understand better what the exact cause of this bug is but in the meantime I think this test is good enough to validate the fix and ship it. |
@@ -83,13 +83,28 @@ func RandomTrace(maxLevels, maxSpans int) model.Trace { | |||
|
|||
// GetTestTrace returns a []Trace that is composed by ``traceN`` number | |||
// of traces, each one composed by ``size`` number of spans. | |||
func GetTestTrace(traceN, size int) model.Traces { | |||
func GetTestTrace(traceN, size int, realisticIDs bool) model.Traces { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the cases when we do not want to have these "realisticIDs" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests which expect the traceID to be 42 eg :
datadog-trace-agent/agent/receiver_test.go
Line 198 in fbc904c
assert.Equal(uint64(42), span.TraceID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to me. The problem has been reproduced and with the patch it's not happening again.
The test doesn't cover exactly the real issue, but the two issues are strictly related. Indeed, without the patch, the regression test is failing, so for now it's good enough to include this change in the next release.
Thanks everyone!
This PR targets a bug we've observed where sometimes the agent will send a HTTP 200 response before all the request's body has been received, which leads to the connection being reset and dropping traces.