Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tablet server fault between file rename and GC candidate addition waste disk space #3811

Open
keith-turner opened this issue Oct 4, 2023 · 2 comments
Assignees
Labels
bug This issue has been verified to be a bug.
Milestone

Comments

@keith-turner
Copy link
Contributor

keith-turner commented Oct 4, 2023

Describe the bug

If a tablet server process dies or throws an exception between the following lines

  1. rename file
  2. add GC candidate and update tablet metadata

then a non temp file with no GC candidate entry exists and it will not be cleaned up by the GC process.

Expected behavior

unreferenced files are always cleaned up.

@keith-turner keith-turner added the bug This issue has been verified to be a bug. label Oct 4, 2023
@keith-turner
Copy link
Contributor Author

This issue was discovered through discussion in a meeting about #3802 and #3804. As a solution to this problem it was discussed that adding the GC candidate could be done prior to the rename, however that could conflict with fixing #3802. So need to consider #3802 when addressing this problem.

@ddanielr ddanielr self-assigned this Oct 11, 2023
@keith-turner
Copy link
Contributor Author

In the elasticity branch it may be possible to eventually address this problem by moving compaction commit into FATE. See #3559. If its a fate operation, then do not need to worry about failure between rename and addition of the candidate because it will rerun in that case.

keith-turner added a commit to keith-turner/accumulo that referenced this issue Dec 21, 2023
The custom refresh tracking code was removed and compaction commit was
moved to being a FATE operation with the following 4 steps.

 1. Rename file done in RenameCompactionFile class
 2. Update the metadata table via a conditional mutation done in
   CommitCompaction class
 3. Write the gc candidates done in PutGcCandidates class
 4. Optionally send a RPC refresh request if the tablet was hosted
    done in RefreshTablet class

There is some follow on work that still needs to be done to improve
how this change works with detecting dead compactions.  After that is
done these changes should address problems outlined apache#3811 and apache#3802
that were related to process death before adding GC candidates.  Now
that GC candidates are written in FATE, if it dies it will run again
later.

This is currently storing the compaction commit FATE operations in
zookeeper.  This would not be suitable for a cluster because per tablet
information should never be stored in zookeeper.  However its fine as a
temporary situation in the elasticity branch until FATE storage is
availabe in an accumulo table, see apache#4049 and apache#3559

WIP

WIP

WIP
keith-turner added a commit that referenced this issue Jan 13, 2024
The custom refresh tracking code was removed and compaction commit was
moved to being a FATE operation with the following 4 steps.

 1. Rename file done in RenameCompactionFile class
 2. Update the metadata table via a conditional mutation done in
   CommitCompaction class
 3. Write the gc candidates done in PutGcCandidates class
 4. Optionally send a RPC refresh request if the tablet was hosted
    done in RefreshTablet class

There is some follow on work that still needs to be done to improve
how this change works with detecting dead compactions.  After that is
done these changes should address problems outlined #3811 and #3802
that were related to process death before adding GC candidates.  Now
that GC candidates are written in FATE, if it dies it will run again
later.

This is currently storing the compaction commit FATE operations in
zookeeper.  This would not be suitable for a cluster because per tablet
information should never be stored in zookeeper.  However its fine as a
temporary situation in the elasticity branch until FATE storage is
availabe in an accumulo table, see #4049 and #3559
@ctubbsii ctubbsii added this to the 3.1.0 milestone Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue has been verified to be a bug.
Projects
None yet
Development

No branches or pull requests

3 participants