-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Opentelemetry: Race condition between connection shutdown and export can result in duplicate spans #44894
Comments
/cc @brunobat (opentelemetry), @radcortez (opentelemetry) |
Here is the original Zulip-chat for this issue: https://quarkusio.zulipchat.com/#narrow/channel/187030-users/topic/Quarkus-Opentelemetry.3A.20Sporadic.20occurence.20of.20duplicate.20spans |
From the chat: |
I might, but don't count on it :) |
Thanks a lot for providing such great details on the issue! One question: have you actually seen |
Hi @geoand sorry for not replying earlier. Somehow missed that comment. |
Gotcha, thanks |
Describe the bug
When there are ~10seconds between two traces/spans beeing exported, it can happen that the exporter closes the connection after the next spans have been exported but before the otel-collector responds with OK.
This is what it looks like on TCP-level:
Here is the actual tcp dump containing the example: /example/tcp_traffic_http.pcap
the server's OK in step 6 is probably dropped (conforming to http-standard). This results in a timeout followed by a retry of the export.
As the otel-collector appearently doesn't deduplicate the spans this results in duplicate spans in the trace.
This is how the trace should look like: trace_normal.json
This is how the trace looks like when the issue occurs: trace_with_duplicates.json
From what I can see in the code, I assume this is a race-condition, where the connection-shutdown is triggered but does not set the isShutdown-flag before the next spans are exported. I imagine something like this is happening:
Expected behavior
The exporter should not export the same traces twice to the otel-collector.
Actual behavior
The exporter sometimes (probably: race condition) closes the connection to the otel-collector right after spans where exported, but before the otel-collector could respond with OK. It will then reopen the connection and send the spans a second time.
How to Reproduce?
Reproducer: https://github.com/arn-ivu/reproducer-duplicate-spans
Requirements:
Steps to reproduce:
./mvnw compile quarkus:dev
Find duplicate Spans in grafana:
{resource.service.name="reproducer-duplicate-spans"} | count() > 2
Monitor the tcp-traffic with wireshark
- check the Opentelemetry Port OTEL_PORT
Output of
uname -a
orver
Linux e5808115ef08 5.15.167.4-microsoft-standard-WSL2 #1 SMP Tue Nov 5 00:21:55 UTC 2024 x86_64 Linux
Output of
java -version
openjdk version "22.0.1" 2024-04-16 OpenJDK Runtime Environment Temurin-22.0.1+8 (build 22.0.1+8) OpenJDK 64-Bit Server VM Temurin-22.0.1+8 (build 22.0.1+8, mixed mode, sharing)
Quarkus version or git rev
3.17.2
Build tool (ie. output of
mvnw --version
orgradlew --version
)Apache Maven 3.9.9
Additional information
While I have analyzed the problem mostly for quarkus.otel.exporter.otlp.protocol=http/protobuf
I did observe the same problem with quarkus.otel.exporter.otlp.protocol=grpc
See https://github.com/arn-ivu/reproducer-duplicate-spans/tree/main/examples for tcp-traffic of both cases
The text was updated successfully, but these errors were encountered: