-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce retries while creating stream message decoder for more robustness #13036
Introduce retries while creating stream message decoder for more robustness #13036
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #13036 +/- ##
============================================
+ Coverage 61.75% 62.17% +0.42%
+ Complexity 207 198 -9
============================================
Files 2436 2515 +79
Lines 133233 137867 +4634
Branches 20636 21335 +699
============================================
+ Hits 82274 85723 +3449
- Misses 44911 45755 +844
- Partials 6048 6389 +341
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
What kind of failures are these ? Are all these transient failures that are likely to go away on retries? |
Yes till now we have mostly seen transient failures. We use internal kafka-clients so transiently we see client-timeouts or something similar during this code-path resulting in segment errors. One example is JDK bug we found in our internal client very similar to user-agent issue called out in #10894. Often this particular bug results in NPE during |
...re/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java
Show resolved
Hide resolved
...re/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java
Show resolved
Hide resolved
...re/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java
Show resolved
Hide resolved
hey @swaminathanmanish can you help with review? |
hey @swaminathanmanish bump on reviewing this |
5753461
to
cf054fa
Compare
0bf5548
to
5e14a73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
...re/src/main/java/org/apache/pinot/core/data/manager/realtime/RealtimeSegmentDataManager.java
Show resolved
Hide resolved
…stness (apache#13036) * Introduce retries while creating stream message decoder to make system more robust * address comments * fix build
We have intermittently seen issues in our clusters while creating streamMessageDecoder. Stack trace:
This stops consumption in one of the replicas and once the other replica starts committing, this stopped replica always ends up in ERROR state. The only way to fix this is to reset this replica's segment.
The behaviour of not consuming in one replica is also dangerous as if the other replica's hosts restarts / goes down due to any reason, it can cause data loss scenarios.
Having a retry policy during StreamMessageDecoder.create() may help reduce the chances of such scenarios.