Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check pointing Leader Scheduler State #5352

Merged
merged 15 commits into from
Jan 28, 2025

Conversation

san81
Copy link
Collaborator

@san81 san81 commented Jan 23, 2025

Description

When crawling through large Jira site, initial crawl might take more time than the Leader's lease time and it could expire before leader completes the crawling action. This fix is to check point frequently while crawling is in progress, which renews the lease as well as saves intermediate leader crawling state.

Issues Resolved

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

sb2k16
sb2k16 previously approved these changes Jan 27, 2025
long updatedAtMillis = Long.parseLong((String) this.metadata.getOrDefault(Constants.UPDATED, "0"));
long createdAtMillis = Long.parseLong((String) this.metadata.getOrDefault(Constants.CREATED, "0"));
long updatedAtMillis = getMetadataField(Constants.UPDATED);
long createdAtMillis = getMetadataField(CREATED);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Constants.CREATED

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed it

graytaylor0
graytaylor0 previously approved these changes Jan 27, 2025
import static org.mockito.Mockito.verify;
import static org.mockito.Mockito.when;
import static org.mockito.internal.verification.VerificationModeFactory.times;

@ExtendWith(MockitoExtension.class)
public class CrawlerTest {
private static final int DEFAULT_BATCH_SIZE = 50;
Instant lastPollTime = Instant.ofEpochMilli(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to make this private

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this private now

Instant currentTimeInstance = Instant.now();
if (Duration.between(lastLeaderSavedInstant, currentTimeInstance).toMinutes() >= 1) {
// intermediate updates to master partition state
updateLeaderProgressState(leaderPartition, latestModifiedTime, coordinator);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just make sure that this won't block creation of any partitions by filtering them out based on the timestamp. You can be a little lenient if needed by making the "lastPollTime" from before the time that you actually start the crawling, since source coordination will dedupe. If there are no issues with this, then it is ideal since that means less calls to source coordination. You can start crawling from lastPollTime - someShortAmountOfTime if there's a chance of any race conditions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at the end of the loop, I am checkpointing with the "lastPollTime" before the crawl start. lastPollTime - someShortAmountOfTime looks like a good idea for the next crawling. I will see the impact of this change in the JQL and play with it to see if we can go with that choice.

Copy link
Member

@Galactus22625 Galactus22625 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good I think, but I have one question about the lastModifiedTime logic

@@ -51,20 +60,40 @@ public Instant crawl(Instant lastPollTime,
continue;
}
itemInfoList.add(nextItem);
if (nextItem.getLastModifiedAt().isAfter(latestModifiedTime)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if an item that has already been added to the list gets modified and then we add a different item to the list that got modified later? Won't we end up skipping the update from the first item?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Max is not impacting the jql query. The max is only for intermediate check pointing.

chenqi0805
chenqi0805 previously approved these changes Jan 27, 2025
Copy link
Member

@Galactus22625 Galactus22625 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hooray

Signed-off-by: Santhosh Gandhe <[email protected]>
@san81 san81 dismissed stale reviews from chenqi0805, graytaylor0, and sb2k16 via e2fd7be January 27, 2025 23:50
@san81 san81 merged commit 440fb2d into opensearch-project:main Jan 28, 2025
43 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants