Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support bucket with different priorities #55

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

simolus3
Copy link

No description provided.

Copy link
Author

@simolus3 simolus3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some of my own comments for a discussion on this.

crates/core/src/operations.rs Outdated Show resolved Hide resolved
crates/core/src/sync_local.rs Outdated Show resolved Hide resolved
crates/core/src/sync_local.rs Outdated Show resolved Hide resolved
crates/core/src/view_admin.rs Show resolved Hide resolved
crates/core/src/sync_local.rs Outdated Show resolved Hide resolved
crates/core/src/sync_local.rs Outdated Show resolved Hide resolved
crates/core/src/sync_local.rs Show resolved Hide resolved
crates/core/src/bucket_priority.rs Outdated Show resolved Hide resolved
crates/core/src/checkpoint.rs Outdated Show resolved Hide resolved
crates/core/src/sync_local.rs Outdated Show resolved Hide resolved
@rkistner
Copy link
Contributor

rkistner commented Feb 4, 2025

After checking the implementation, I realized that a row present in multiple buckets with different priorities has a big potential for edge cases - both in the spec and the specific implementation. This is specifically relevant for the r.bucket IN (SELECT id FROM involved_buckets) clause for example.

A potential use case could be where a user has a lot of data, and "stars" specific items to prioritize the sync of them.

For these examples, suppose we have two buckets: bucket1 and bucket2, with priorities 1 and 2 respectively. The same row could be in either or both of the buckets.

Current state

Case 1: A row is present in bucket 2, then added to bucket1. We get a partial_checkpoint_complete for bucket1.
The row will be included here, getting the latest version. ✅

Case 2: A row is removed from bucket2, and added to bucket1 at the same time ("starring" a row to move it to a higher priority). We get a partial_checkpoint_complete for bucket1. The row will be included here, and whether or not we got the REMOVE on bucket2 already doesn't make a difference. When we sync the rest of the checkpoint, the updated row will stay present. ✅

Case 3: A row is removed from bucket1, and added to bucket2 at the same time (removing a star to move to lower priority). We get a partial_checkpoint_complete for bucket1. We don't track removes per bucket, so this does nothing. ✅ When we sync the rest of the checkpoint, the row will be added again. ✅

Case 4: A row is present in bucket1 and bucket2, then removed from bucket1. We get a partial_checkpoint_complete for bucket1. We don't track removes per bucket, so this does nothing. ✅ When we sync the rest of the checkpoint, the row is updated with the state from bucket2. ✅

Hypothetical, if we tracked remove operations per bucket

Case 1: A row is present in bucket 2, then added to bucket1. We get a partial_checkpoint_complete for bucket1.
The row will be included here, getting the latest version. ✅

Case 2: A row is removed from bucket2, and added to bucket1 at the same time ("starring" a row to move it to a higher priority). We get a partial_checkpoint_complete for bucket1. The row will be included here, and whether or not we got the REMOVE on bucket2 already doesn't make a difference. When we sync the rest of the checkpoint, the updated row will stay present. ✅

Case 3: A row is removed from bucket1, and added to bucket2 at the same time (removing a star to move to lower priority). We get a partial_checkpoint_complete for bucket1. The row will be removed here, whether or not we got the PUT for bucket2. This works "according to the spec", but not sure if this is desired behavior. ❓ When we sync the rest of the checkpoint, the row will be added again. ✅

Case 4: A row is present in bucket1 and bucket2, then removed from bucket1. We get a partial_checkpoint_complete for bucket1. The row will be removed here, despite it still being present in bucket2. ❓ When we sync the rest of the checkpoint, the row will not be added again, resulting in an inconsistency. ❌

My summary from this that adding support for REMOVE operations in partial checkpoints does not actually give improved consistency as I hoped - it just creates more weird edge cases. The current behavior of only applying the REMOVE operations in the final checkpoint gives better results.

@simolus3 simolus3 marked this pull request as ready for review February 10, 2025 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants