-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch exporter: event publishing #2395
Elasticsearch exporter: event publishing #2395
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2395 +/- ##
===========================================
+ Coverage 69.33% 90.89% +21.56%
===========================================
Files 34 408 +374
Lines 1601 20422 +18821
===========================================
+ Hits 1110 18562 +17452
- Misses 416 1398 +982
- Partials 75 462 +387
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Implement per document retry behavior and a set of tests verifying that the retry behavior is correctly configured and executed in conjunction with the BulkIndexer. The change introduces helpers for creating a fake Elasticsearch endpoint that accepts bulk requests. This allows us to test the retry behavior in case of multiple scenarios.
aa85fc8
to
1caeaf2
Compare
hm, CI failed in unrelated unit test:
CircleCI does not allow me to rerun the test (without pushing a change)... |
@urso - Yeah, not being able to re-run the test from CircleCI is a pain point; however, you can add a commit with |
PR adding Codeowners for the elasticsearchexporter directory: #2462 |
I will leave it to the exporter/elasticsearchexporter codeowners to review. @urso please let me know if you'd like additional review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, some small comments about naming and the details of the retry logic
EnableRetryOnTimeout: true, | ||
MaxRetries: config.Retry.MaxRequests, | ||
DisableRetry: !config.Retry.Enabled, | ||
EnableRetryOnTimeout: !config.Retry.Enabled, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems counterintuitive that both DisableRetry
and EnableRetryOnTimeout
are set to !config.RetryEnabled
, should the second line have the !
removed? If not, a clarifying comment would be nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. Yes, the second should not be not.
MaxRetries: config.Retry.MaxRequests, | ||
DisableRetry: !config.Retry.Enabled, | ||
EnableRetryOnTimeout: !config.Retry.Enabled, | ||
MaxRetries: maxAttempts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming clarity: I generally think of "attempts" and "retries" as being off by 1 -- i.e. if the max "attempts" is 3, that means you only "retry" (try again) twice. This convention isn't universal, but it seems confusing to use both terms in close proximity. It would be nice to stick with the same term in both structs and local variables if possible, or to note the discrepancy at the definition sites if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The structs is not defined in the exporter but the go-elasticsearch dependency. Debugging the tests I learned that the library starts the count with 1 (if 0 is passed the library will default to 3 retries instead of disabling retries). Will add a clarifying comment to maxAttempts
. After testing I figured that the behavior is more similar to an attempt, which is what I named the variable with.
I generally think of "attempts" and "retries" as being off by 1 -- i.e. if the max "attempts" is 3, that means you only "retry" (try again) twice.
Yes, I did expect the same. I did open a bug report: elastic/go-elasticsearch#233
The tests did fail with the original value of MaxRetries. In case semantics will be changed we will have the tests fail when updating.
// selective ACKing in the bulk response. | ||
item.OnFailure = func(ctx context.Context, item esBulkIndexerItem, resp esBulkIndexerResponseItem, err error) { | ||
switch { | ||
case e.maxRetries < 0 || attempts < e.maxRetries && shouldRetryEvent(resp.Status): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me how maxRetries
and Retry.Enabled
interact. Is it possible to have Retry.Enabled
set to false while still setting maxRetries > 1
? If so, there should probably be a check for that here, or a check in newElasticSearchClient
that overwrites the configured maxRetries
if Retry.Enabled
is false. If not, a comment explaining why this can't arise would be nice :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if Retry.Enabled
is false, then events should be dropped immediately on error. I used to have a check somewhere. Let me check and maybe add more tests if necessary to verify Retry.Enabled=false
really disables all retry attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renaming maxRetries
to maxAttempts
. The maxAttempts
is actually configured in newExporter
. In case retry is disabled we set maxAttempts = 1
. On Failure the predicate always evaluates to false thanks to the proper initialization of maxAttempts
.
Note: the newElasticsearchClient configures the HTTP client. If the http request did succeed, we will have the bulk indexer report the per item status to OnSuccess
and OnFailure
. The client handles the HTTP retries for us, but the per item retries we have to handle ourselves.
retrigger CI after unrelated test failure on windows: ``` --- FAIL: Test_statsdreceiver_EndToEnd (0.00s) --- FAIL: Test_statsdreceiver_EndToEnd/default_config_with_9s_interval (0.00s) receiver_test.go:136: Error Trace: receiver_test.go:136 Error: Received unexpected error: listen udp 127.0.0.1:50633: bind: An attempt was made to access a socket in a way forbidden by its access permissions. Test: Test_statsdreceiver_EndToEnd/default_config_with_9s_interval FAIL FAIL github.com/open-telemetry/opentelemetry-collector-contrib/receiver/statsdreceiver ```
* Elasticsearch exporter: event publishing Implement per document retry behavior and a set of tests verifying that the retry behavior is correctly configured and executed in conjunction with the BulkIndexer. The change introduces helpers for creating a fake Elasticsearch endpoint that accepts bulk requests. This allows us to test the retry behavior in case of multiple scenarios. * add missing header * rename documentAction to itemRequest * typo * rerun CI * review + more retry disabled tests * test retrigger CI after unrelated test failure on windows: ``` --- FAIL: Test_statsdreceiver_EndToEnd (0.00s) --- FAIL: Test_statsdreceiver_EndToEnd/default_config_with_9s_interval (0.00s) receiver_test.go:136: Error Trace: receiver_test.go:136 Error: Received unexpected error: listen udp 127.0.0.1:50633: bind: An attempt was made to access a socket in a way forbidden by its access permissions. Test: Test_statsdreceiver_EndToEnd/default_config_with_9s_interval FAIL FAIL github.com/open-telemetry/opentelemetry-collector-contrib/receiver/statsdreceiver ```
Signed-off-by: Bogdan Drutu <[email protected]>
Description: Implement per document retry behavior and a set of tests verifying that the retry behavior is correctly configured and executed in conjunction with the BulkIndexer.
The change introduces helpers for creating a fake Elasticsearch endpoint that accepts bulk requests. This allows us to test the retry behavior for different failure scenarios.
Link to tracking Issue: #1800
Testing:
TestExporter_PushEvent
tests the retry behavior for multiple error scenarios like http requests failing, or single documents being rejected due to overload.Documentation: None. No user visible changes