Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to propagate OffsetOutOfRange error #1183

Conversation

muirdm
Copy link

@muirdm muirdm commented Oct 3, 2018

When consuming a partition using a consumer group, the code
handles ErrOffsetOutOfRange errors by resetting to the "initial"
position, as specified by user (i.e. either oldest or newest available
offset). This, however, can be very dangerous. Say a consumer has
consumed up to offset 100 on replica A but replica B has only
replicated up to offset 99 due to temporary under-replication. During
a rebalance, sarama can end up with an offset out-of-range error if it
fetches partition metadata from replica B since the desired offset of
100 is greater than the newest offset of 99. The sarama consumer would
reset the offset in this case, which can cause reprocessing of old
data, especially if the initial offset is configured as "oldest".

This commit adds a config flag to disable this automatic reset. In the
above case, the consumer will be able to proceed normally after the
data replicates.

Resolves #1181

When consuming a partition using a consumer group, the code
handles ErrOffsetOutOfRange errors by resetting to the "initial"
position, as specified by user (i.e. either oldest or newest available
offset). This, however, can be very dangerous. Say a consumer has
consumed up to offset 100 on replica A but replica B has only
replicated up to offset 99 due to temporary under-replication. During
a rebalance, sarama can end up with an offset out-of-range error if it
fetches partition metadata from replica B since the desired offset of
100 is greater than the newest offset of 99. The sarama consumer would
reset the offset in this case, which can cause reprocessing of old
data, especially if the initial offset is configured as "oldest".

This commit adds a config flag to disable this automatic reset. In the
above case, the consumer will be able to proceed normally after the
data replicates.
@varun06
Copy link
Contributor

varun06 commented Feb 25, 2019

@dim can you please have a look also @muirrn can you please add a test to it?

@dim
Copy link
Contributor

dim commented Feb 28, 2019

@varun06 I don't think this is easily testable. It's not something that can be easily re-created in a test scenario.

@muirrn thanks for the detailed explanation and the patch. One question: how would you handle these cases. Assuming the ErrOffsetOutOfRange error is returned to the user, how would you decide what to do next? I wonder if there is a more generic solution to the problem.

@muirdm
Copy link
Author

muirdm commented Feb 28, 2019

The error gets propagated and we retry consuming until it works.

Maybe a general solution is to only fetch offsets from the group coordinator, but I'm not really sure.

@dim
Copy link
Contributor

dim commented Feb 28, 2019

@muirrn the reason why we added this in the first place was an endless loop of ErrOffsetOutOfRange errors when a high-volume partition (with a very low TTL) was pruned and the stored offset was < the first available offset. I think your patch is safe to apply, but I would even go a bit further and make ResetInvalidOffsets to false by default

@ghost
Copy link

ghost commented Feb 21, 2020

Thank you for your contribution! However, this pull request has not had any activity in the past 90 days and will be closed in 30 days if no updates occur.
If you believe the changes are still valid then please verify your branch has no conflicts with master and rebase if needed. If you are awaiting a (re-)review then please let us know.

@ghost ghost added the stale Issues and pull requests without any recent activity label Feb 21, 2020
@muirdm
Copy link
Author

muirdm commented Feb 24, 2020

@dim are you still in favor of changing the default to false and proceeding with this change? I don't have a good understanding of how this might affect other users, so I was trying to be conservative.

@ghost ghost removed the stale Issues and pull requests without any recent activity label Feb 24, 2020
@dim
Copy link
Contributor

dim commented Mar 31, 2020

@muirdm yes, I still think that false may be a better default. I doubt it will affect anyone TBH. Apologies for the delayed response.

@ghost
Copy link

ghost commented Mar 16, 2021

Thank you for your contribution! However, this pull request has not had any activity in the past 90 days and will be closed in 30 days if no updates occur.
If you believe the changes are still valid then please verify your branch has no conflicts with master and rebase if needed. If you are awaiting a (re-)review then please let us know.

@ghost ghost added the stale Issues and pull requests without any recent activity label Mar 16, 2021
@bai bai closed this Sep 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Issues and pull requests without any recent activity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consumer group offset can reset during rebalance if underreplicated
4 participants