Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elasticsearch exporter: event publishing #2395

Merged
merged 7 commits into from
Mar 2, 2021

Conversation

urso
Copy link

@urso urso commented Feb 22, 2021

Description: Implement per document retry behavior and a set of tests verifying that the retry behavior is correctly configured and executed in conjunction with the BulkIndexer.

The change introduces helpers for creating a fake Elasticsearch endpoint that accepts bulk requests. This allows us to test the retry behavior for different failure scenarios.

Link to tracking Issue: #1800

Testing: TestExporter_PushEvent tests the retry behavior for multiple error scenarios like http requests failing, or single documents being rejected due to overload.

Documentation: None. No user visible changes

@urso urso requested a review from a team February 22, 2021 00:31
@codecov
Copy link

codecov bot commented Feb 22, 2021

Codecov Report

Merging #2395 (f989e3c) into main (3b70661) will increase coverage by 21.56%.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #2395       +/-   ##
===========================================
+ Coverage   69.33%   90.89%   +21.56%     
===========================================
  Files          34      408      +374     
  Lines        1601    20422    +18821     
===========================================
+ Hits         1110    18562    +17452     
- Misses        416     1398      +982     
- Partials       75      462      +387     
Flag Coverage Δ
integration 69.33% <ø> (ø)
unit 89.73% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
processor/k8sprocessor/processor.go 100.00% <0.00%> (ø)
...eceiver/k8sclusterreceiver/collection/collector.go 41.33% <0.00%> (ø)
exporter/datadogexporter/metrics_translator.go 79.25% <0.00%> (ø)
receiver/receivercreator/observerhandler.go 63.63% <0.00%> (ø)
...eiver/awsxrayreceiver/internal/translator/cause.go 100.00% <0.00%> (ø)
...transformprocessor/operation_delete_label_value.go 100.00% <0.00%> (ø)
exporter/loadbalancingexporter/factory.go 100.00% <0.00%> (ø)
receiver/receivercreator/rules.go 64.28% <0.00%> (ø)
exporter/sentryexporter/utils.go 100.00% <0.00%> (ø)
exporter/f5cloudexporter/factory.go 91.66% <0.00%> (ø)
... and 386 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b70661...f989e3c. Read the comment docs.

@gramidt
Copy link
Member

gramidt commented Feb 22, 2021

Codecov should pass once #2391 and #2389 are merged.

urso added 4 commits February 23, 2021 17:00
Implement per document retry behavior and a set of tests verifying that
the retry behavior is correctly configured and executed in conjunction
with the BulkIndexer.

The change introduces helpers for creating a fake Elasticsearch endpoint
that accepts bulk requests. This allows us to test the retry behavior
in case of multiple scenarios.
@urso urso force-pushed the elasticsearch-exporter-retry branch from aa85fc8 to 1caeaf2 Compare February 23, 2021 16:00
@urso
Copy link
Author

urso commented Feb 23, 2021

hm, CI failed in unrelated unit test:

ok  	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/awsxrayreceiver/internal/translator	0.079s	coverage: 97.7% of statements
--- FAIL: TestSocketReadIrrecoverableNetError (0.01s)
    obsreporttest.go:160: 
        	Error Trace:	obsreporttest.go:160
        	            				obsreporttest.go:113
        	            				poller_test.go:301
        	Error:      	could not find tags
        	Test:       	TestSocketReadIrrecoverableNetError
        	Messages:   	wantTags: [{{receiver} TestSocketReadIrrecoverableNetError} {{transport} udp}] in rows []
FAIL

CircleCI does not allow me to rerun the test (without pushing a change)...

@gramidt
Copy link
Member

gramidt commented Feb 23, 2021

@urso - Yeah, not being able to re-run the test from CircleCI is a pain point; however, you can add a commit with --allow-empty to re-kick it off.

@tigrannajaryan
Copy link
Member

@axw @simitt @jalvz you are codeowners of elasticexporter. Please review this PR.

@urso please add @axw @simitt @jalvz (or other Elastic contributos) as codeowners for elasticsearchexporter directory.

@urso
Copy link
Author

urso commented Feb 24, 2021

(or other Elastic contributos) as codeowners for elasticsearchexporter directory.

PR adding Codeowners for the elasticsearchexporter directory: #2462

@axw
Copy link
Contributor

axw commented Feb 25, 2021

I will leave it to the exporter/elasticsearchexporter codeowners to review. @urso please let me know if you'd like additional review.

Copy link

@faec faec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, some small comments about naming and the details of the retry logic

EnableRetryOnTimeout: true,
MaxRetries: config.Retry.MaxRequests,
DisableRetry: !config.Retry.Enabled,
EnableRetryOnTimeout: !config.Retry.Enabled,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems counterintuitive that both DisableRetry and EnableRetryOnTimeout are set to !config.RetryEnabled, should the second line have the ! removed? If not, a clarifying comment would be nice

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. Yes, the second should not be not.

MaxRetries: config.Retry.MaxRequests,
DisableRetry: !config.Retry.Enabled,
EnableRetryOnTimeout: !config.Retry.Enabled,
MaxRetries: maxAttempts,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming clarity: I generally think of "attempts" and "retries" as being off by 1 -- i.e. if the max "attempts" is 3, that means you only "retry" (try again) twice. This convention isn't universal, but it seems confusing to use both terms in close proximity. It would be nice to stick with the same term in both structs and local variables if possible, or to note the discrepancy at the definition sites if not.

Copy link
Author

@urso urso Feb 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The structs is not defined in the exporter but the go-elasticsearch dependency. Debugging the tests I learned that the library starts the count with 1 (if 0 is passed the library will default to 3 retries instead of disabling retries). Will add a clarifying comment to maxAttempts. After testing I figured that the behavior is more similar to an attempt, which is what I named the variable with.

I generally think of "attempts" and "retries" as being off by 1 -- i.e. if the max "attempts" is 3, that means you only "retry" (try again) twice.

Yes, I did expect the same. I did open a bug report: elastic/go-elasticsearch#233

The tests did fail with the original value of MaxRetries. In case semantics will be changed we will have the tests fail when updating.

// selective ACKing in the bulk response.
item.OnFailure = func(ctx context.Context, item esBulkIndexerItem, resp esBulkIndexerResponseItem, err error) {
switch {
case e.maxRetries < 0 || attempts < e.maxRetries && shouldRetryEvent(resp.Status):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious to me how maxRetries and Retry.Enabled interact. Is it possible to have Retry.Enabled set to false while still setting maxRetries > 1? If so, there should probably be a check for that here, or a check in newElasticSearchClient that overwrites the configured maxRetries if Retry.Enabled is false. If not, a comment explaining why this can't arise would be nice :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if Retry.Enabled is false, then events should be dropped immediately on error. I used to have a check somewhere. Let me check and maybe add more tests if necessary to verify Retry.Enabled=false really disables all retry attempts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renaming maxRetries to maxAttempts. The maxAttempts is actually configured in newExporter. In case retry is disabled we set maxAttempts = 1. On Failure the predicate always evaluates to false thanks to the proper initialization of maxAttempts.

Note: the newElasticsearchClient configures the HTTP client. If the http request did succeed, we will have the bulk indexer report the per item status to OnSuccess and OnFailure. The client handles the HTTP retries for us, but the per item retries we have to handle ourselves.

urso added 2 commits February 26, 2021 00:27
retrigger CI after unrelated test failure on windows:

```
--- FAIL: Test_statsdreceiver_EndToEnd (0.00s)
    --- FAIL: Test_statsdreceiver_EndToEnd/default_config_with_9s_interval (0.00s)
        receiver_test.go:136:
            	Error Trace:	receiver_test.go:136
            	Error:      	Received unexpected error:
            	            	listen udp 127.0.0.1:50633: bind: An attempt was made to access a socket in a way forbidden by its access permissions.
            	Test:       	Test_statsdreceiver_EndToEnd/default_config_with_9s_interval
FAIL
FAIL	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/statsdreceiver
```
@bogdandrutu bogdandrutu merged commit 6d75ea4 into open-telemetry:main Mar 2, 2021
pmatyjasek-sumo referenced this pull request in pmatyjasek-sumo/opentelemetry-collector-contrib Apr 28, 2021
* Elasticsearch exporter: event publishing

Implement per document retry behavior and a set of tests verifying that
the retry behavior is correctly configured and executed in conjunction
with the BulkIndexer.

The change introduces helpers for creating a fake Elasticsearch endpoint
that accepts bulk requests. This allows us to test the retry behavior
in case of multiple scenarios.

* add missing header

* rename documentAction to itemRequest

* typo

* rerun CI

* review + more retry disabled tests

* test

retrigger CI after unrelated test failure on windows:

```
--- FAIL: Test_statsdreceiver_EndToEnd (0.00s)
    --- FAIL: Test_statsdreceiver_EndToEnd/default_config_with_9s_interval (0.00s)
        receiver_test.go:136:
            	Error Trace:	receiver_test.go:136
            	Error:      	Received unexpected error:
            	            	listen udp 127.0.0.1:50633: bind: An attempt was made to access a socket in a way forbidden by its access permissions.
            	Test:       	Test_statsdreceiver_EndToEnd/default_config_with_9s_interval
FAIL
FAIL	github.com/open-telemetry/opentelemetry-collector-contrib/receiver/statsdreceiver
```
@urso urso deleted the elasticsearch-exporter-retry branch May 25, 2021 22:40
ljmsc referenced this pull request in ljmsc/opentelemetry-collector-contrib Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants