Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential race condition between compactor applying retention or compaction and store gateway syncing metas. #564

Closed
bwplotka opened this issue Oct 10, 2018 · 9 comments

Comments

@bwplotka
Copy link
Member

bwplotka commented Oct 10, 2018

With compaction or retention logic we have one "writer" that creates new blocks (compaction) and deletes blocks that were source of it.

The problem with our readers (store) is that syncing is periodically every X seconds. So it might happen that we query store during time of compactor remove the block, but store did not sync yet. There is no watch logic for Bucket API.

The simplest solution is to defer deletion in some time in future to address potential eventual consistency of store Gateway internal state (and potentially bucket itself)

Acceptance criteria:

  • Store Gateway will have loaded new block before old compaction source blocks are deleted.
  • Store Gateway handling paritally deleted, empty or non existent TSDB block, despite having cached meta-index file.

This unfortunately requires the heavy modification of compactor Plan logic to understand the edge cases like:

  • Compactor creates new block. Defers source block deletions and suddenly stopped working.
    • After restart compactor needs to detect that block overlap is because of defered deletion.
    • What if bucket shows newly created block as partial? (eventual consistency), What if compactor crashed before upload completed? How to differentiate those two?
    • What if defered deletion fails in between?

This has to be done and is planned to be done EOY

@bwplotka bwplotka added the bug label Oct 10, 2018
@bwplotka bwplotka changed the title Potential race condidtion between compactor applying retention and store gateway syncing metas. Potential race condition between compactor applying retention and store gateway syncing metas. Oct 10, 2018
@bwplotka bwplotka changed the title Potential race condition between compactor applying retention and store gateway syncing metas. Potential race condition between compactor applying retention or compaction and store gateway syncing metas. Dec 11, 2018
@bwplotka
Copy link
Member Author

This will be fixed by: #1528 help wanted (:

@Reamer
Copy link

Reamer commented Nov 25, 2019

Why we doesn't trigger a resync with remote in reader component (thanos store), if a block (deleted by compactor) isn't found?

@bwplotka
Copy link
Member Author

bwplotka commented Dec 13, 2019

@Reamer

We can, but still sync takes time, so ideally we can do it in a smarter way as defined by #1528

Also trigger will involve coordination which we don't want.

@stale
Copy link

stale bot commented Feb 6, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Feb 6, 2020
@bwplotka
Copy link
Member Author

bwplotka commented Feb 6, 2020

@khyatisoneji is on it (:

@stale
Copy link

stale bot commented Mar 7, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Mar 7, 2020
@stale stale bot closed this as completed Mar 14, 2020
@bwplotka
Copy link
Member Author

We are super close to merge the fix! But not yet fixed.

@bwplotka bwplotka reopened this Mar 14, 2020
@stale stale bot removed the stale label Mar 14, 2020
@bwplotka
Copy link
Member Author

Fixed by #2136

@stale
Copy link

stale bot commented Apr 16, 2020

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

@stale stale bot added the stale label Apr 16, 2020
@bwplotka bwplotka removed the stale label Apr 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants