Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent multiple attempts to publish segments for the same sequence #14995

Merged

Conversation

AmatyaAvadhanula
Copy link
Contributor

Description

Fixes an issue where a race can cause a second attempt to publish segments for the same sequence.
During the second publish, the segments have already been published and are no longer associated with the sequence.
This causes the tasks to fail with an error like:
org.apache.druid.java.util.common.ISE: Attempting to publish with empty segment set, but total row count was not 0: [165263].

Observations

The following sequence of events can cause the error

  1. Submit a publish request for all segments associated with sequence id x
  2. Iterate over the same sequence x in maybePersistAndPublishSequences
  3. The publish request in step 1 succeeds, and its callback to clear segments and other states associataed with sequence x.
  4. Since sequence x is no longer in the set of publishing segments, it is considered for publish again.
  5. No segments are associated with sequence x, and new data has been appended for sequence x + 1.
  6. Step 4 fails as the number of segment is 0, but the appenderator has non-zero rows, causing the above error.

Fix

When iterating over the sequences to check if they need to published, maintain a snapshot of the publishing sequences before iterating, so that a removal on successful publish doesn't cause the race to occur.
Tested on a local cluster by setting thread level breakpoints.

Release note



This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@AmatyaAvadhanula AmatyaAvadhanula marked this pull request as ready for review September 15, 2023 15:31
@kfaraz
Copy link
Contributor

kfaraz commented Sep 21, 2023

@AmatyaAvadhanula , could you please add some tests to verify the fixed behaviour?

// The callback for a successful segment publish may remove a sequence from the publishingSequences,
// which is racy and can allow the same sequence to be added to the set again.
// Create a copy of publishing sequences to which we can only add elements, and not remove them.
final Set<String> publishingSequencesSnapshot = new HashSet<>(publishingSequences);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new code will be simpler if we snapshot both sequences and publishingSequences and then create a third set which contains (sequencesSnapshot - publishingSequencesSnapshot) and iterate over that instead of sequencesSnapshot.

Also, I wonder if there can still be duplicates as it really depends on the timing of snapshotting publishingSequences. Would a better solution be to track publishedSequences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new code will be simpler if we snapshot both sequences and publishingSequences and then create a third set which contains (sequencesSnapshot - publishingSequencesSnapshot) and iterate over that instead of sequencesSnapshot

As discussed offline, this is not being done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I wonder if there can still be duplicates as it really depends on the timing of snapshotting publishingSequences. Would a better solution be to track publishedSequences?

I think that this may not happen since the removal happens from sequences and then from publishingSequences, so this order may help. Although it doesn't seem ideal.

Thanks, I'll try to track publishedSequences instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@AmatyaAvadhanula , I assume the set of published sequences will never be too large to cause any memory concerns for the task, right?

Comment on lines 1163 to 1164
if (!sequenceMetadata.isOpen() &&
!publishingSequences.contains(sequenceMetadata.getSequenceName())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: symmetry with the next condition.

Suggested change
if (!sequenceMetadata.isOpen() &&
!publishingSequences.contains(sequenceMetadata.getSequenceName())
if (!sequenceMetadata.isOpen()
&& !publishingSequences.contains(sequenceMetadata.getSequenceName())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AmatyaAvadhanula
Copy link
Contributor Author

I assume the set of published sequences will never be too large to cause any memory concerns for the task, right?

Yes, @kfaraz

@AmatyaAvadhanula AmatyaAvadhanula merged commit cdc192d into apache:master Nov 16, 2023
CaseyPan pushed a commit to CaseyPan/druid that referenced this pull request Nov 17, 2023
…pache#14995)

* Prevent a race that may cause multiple attempts to publish segments for the same sequence
writer-jill pushed a commit to writer-jill/druid that referenced this pull request Nov 20, 2023
…pache#14995)

* Prevent a race that may cause multiple attempts to publish segments for the same sequence
yashdeep97 pushed a commit to yashdeep97/druid that referenced this pull request Dec 1, 2023
…pache#14995)

* Prevent a race that may cause multiple attempts to publish segments for the same sequence
@LakshSingla LakshSingla added this to the 29.0.0 milestone Jan 29, 2024
Pankaj260100 pushed a commit to confluentinc/druid that referenced this pull request Feb 23, 2024
…pache#14995)

* Prevent a race that may cause multiple attempts to publish segments for the same sequence
Pankaj260100 pushed a commit to confluentinc/druid that referenced this pull request Feb 23, 2024
…pache#14995)

* Prevent a race that may cause multiple attempts to publish segments for the same sequence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants