Skip to content
This repository has been archived by the owner on Aug 30, 2019. It is now read-only.

Always drain requests before responding #341

Merged
merged 3 commits into from
Jan 8, 2018

Conversation

bmermet
Copy link

@bmermet bmermet commented Dec 12, 2017

This PR targets a bug we've observed where sometimes the agent will send a HTTP 200 response before all the request's body has been received, which leads to the connection being reset and dropping traces.

@palazzem palazzem added this to the 5.20.1 milestone Dec 12, 2017
@palazzem palazzem added the core label Dec 12, 2017
@palazzem palazzem self-requested a review December 12, 2017 15:32
Copy link

@palazzem palazzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do more testing as we've discussed, though the approach should fix the problem with the pre-sampler.

@@ -195,6 +196,7 @@ func (r *HTTPReceiver) replyTraces(v APIVersion, w http.ResponseWriter) {
// handleTraces knows how to handle a bunch of traces
func (r *HTTPReceiver) handleTraces(v APIVersion, w http.ResponseWriter, req *http.Request) {
if !r.preSampler.Sample(req) {
io.Copy(ioutil.Discard, req.Body)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

trace := model.Trace{}
for j := 0; j < size; j++ {
trace = append(trace, GetTestSpan())
span := GetTestSpan()
if realisticIDs {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced this because I noticed in some tests, when in debug mode, you'd see many messages in the log saying traces had been dropped because they had spans with the same ID. And, this is legit. It's a nitpick, but not having realistic traces makes tests less realistic and so they might miss interesting real-world cases.

@@ -519,17 +523,80 @@ func TestHandleTraces(t *testing.T) {
ts, ok := rs.Stats[info.Tags{Lang: lang}]
assert.True(ok)
assert.Equal(int64(20), ts.TracesReceived)
assert.Equal(int64(57622), ts.TracesBytes)
assert.Equal(int64(59222), ts.TracesBytes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight increase because trace IDs and span IDs are longer. It's still predictable (seeded random) so no flakiness.

// chunkedReader is a reader which forces partial reads, this is required
// to trigger some network related bugs, such as body not being read fully by server.
// Without this, all the data could be read/written at once, not triggering the issue.
type chunkedReader struct {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That, was the tricky part. Without this it's impossible to reproduce the bug.

}}
for i := 0; i < 3; i++ {
wg.Add(1)
go func() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure this goroutine thing is totally mandatory, but it's more realistic anyway and does not cost that much.


// Make sure we use share clients, and they are reused.
client := &http.Client{Transport: &http.Transport{
MaxIdleConnsPerHost: 100,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increasing MaxIdleConnsPerHost is common practice to ensure the connection is reused.

@bmermet
Copy link
Author

bmermet commented Jan 8, 2018

Great job for finding a way to trigger this bug 👍

Though I'm not sure that it is exactly the same bug we've been observing in real use cases. The bug triggered in this test disappears if we don't send requests concurrently from several goroutines, while the behavior of dd-trace-go is to flush traces only from a single goroutine. On top of that the error message is different, here it is "HTTP/1.x transport connection broken: write tcp 127.0.0.1:60168->127.0.0.1:60161: write: broken pipe" while what we observed in the real world bug was: "HTTP/1.x transport connection broken: write tcp 127.0.0.1:49133->127.0.0.1:8126: write: connection reset by peer".

The two bugs look extremely similar and certainly share a common cause since this patch fixes both. I would like to understand better what the exact cause of this bug is but in the meantime I think this test is good enough to validate the fix and ship it.

@@ -83,13 +83,28 @@ func RandomTrace(maxLevels, maxSpans int) model.Trace {

// GetTestTrace returns a []Trace that is composed by ``traceN`` number
// of traces, each one composed by ``size`` number of spans.
func GetTestTrace(traceN, size int) model.Traces {
func GetTestTrace(traceN, size int, realisticIDs bool) model.Traces {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the cases when we do not want to have these "realisticIDs" ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests which expect the traceID to be 42 eg :

assert.Equal(uint64(42), span.TraceID)

Copy link

@palazzem palazzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to me. The problem has been reproduced and with the patch it's not happening again.

The test doesn't cover exactly the real issue, but the two issues are strictly related. Indeed, without the patch, the regression test is failing, so for now it's good enough to include this change in the next release.

Thanks everyone!

@palazzem palazzem merged commit d473c05 into master Jan 8, 2018
@palazzem palazzem deleted the bmermet/samplingdrainsrequess branch January 8, 2018 17:40
@palazzem palazzem modified the milestone: 5.21.1 Jan 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants