-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix batching logic with write records, introduce concurrent requests #8947
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the pull request!
🤝 ✒️ Just a reminder that the CLA has not yet been signed, and we'll need it before merging. Please sign the CLA when you get a chance, then post a comment here saying !signed-cla
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the test file needs to be formatted
fixed the gofmt issue with timestream_test.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change looks thorough, but it's a significant one. Not sure about the parallelism.
if err := t.writeToTimestream(writeRecordsInput, true); err != nil { | ||
return err | ||
} | ||
go func(inp *timestreamwrite.WriteRecordsInput) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the upper bound on the number of goroutines we're going to launch here? If I set my batch size to something ridiculously large, could you get hundreds or thousands of concurrent requests? Generally I'm also not really a fan of parallel writes in the output here. You're typically not going to see much in the way of improved throughput.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Currently it is unbounded.
I can add a semaphore that puts an upper bound on the concurrent go-routines.
https://github.com/golang/sync/blob/master/semaphore/semaphore.go
We are making this change as we observed that metrics were being dropped due to the requests taking longer serially. After making this change, things improved, and the metric drop stopped.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay in replying to this, we're catching up a little with outstanding PRs.
Yup, that sounds like a good idea to put an upper bound on the requests, please proceed with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the upper bound on the concurrency, and have introduced a parameter for the same. So customers should be able to decide on the concurrency.
|
||
// On partial failures, Telegraf will reject the entire batch of metrics and | ||
// retry. writeToTimestream will return retryable exceptions only. | ||
err, _ := <-errs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to read from this channel len(writeRecordsInputs) times, and then you can drop the waitgroup, because this will act as a natural block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 355, we are only adding to errs channel if err != nil
Reading it n times currently just blocks forever when error does not happen.
I can try removing the != nil check and see if it works un-interupted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds reasonable, to remove the nil check and always return the result from writeToTimestream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a range over channel. It exists on the first encountered error. But since the channel is already closed before that, it should be able to garbage collected?
!signed-cla |
Thanks so much for the pull request! |
Nirmesh - CCLA wise, I believe you're pending the Influx folk updating a database on their side. |
!signed-cla |
any update about this PR ? |
Hi sorry for the delay in addressing this. Picking it up again now. Will raise the request with everything addressed. |
3f01a9d
to
46bb44b
Compare
I've spent some time today testing the artifacts, and they clearly perform better than the existing Telegraf binaries. I haven't thrown full prod loading at it, but it's now surviving our pre-prod throughput tests which version 1.21.1 manifestly doesn't. In a side-by-side test with another metrics platform we are getting highly similar results, so we don't seem to be leaking metrics. Looking forward to this getting merged. |
b66c1e1
to
834a9fe
Compare
📦 Looks like new artifacts were built from this PR. Expand this list to get them here ! 🐯Artifact URLs |
@powersj Fixed the lint error. All checks passing now? Have re-based it over the latest changes as well. |
Required for all PRs: