Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: allow kernel to read tables with invalid _last_checkpoint #311

Merged

Conversation

zachschuermann
Copy link
Collaborator

Previously, read_last_checkpoint would return Err(<json error>) if json parsing of the _last_checkpoint file failed. This caused the upstream call to fail due to propagation of the error. Instead we now return None from read_last_checkpoint in this case - similar to what is returned if the _last_checkpoint is not found. This allows log replay to continue without the hint and read the table.

Additionally, this unblocks the corrupted-last-checkpoint-kernel golden table test.

Copy link

codecov bot commented Aug 8, 2024

Codecov Report

Attention: Patch coverage is 92.68293% with 3 lines in your changes missing coverage. Please review.

Project coverage is 72.63%. Comparing base (3143264) to head (de18057).

Files Patch % Lines
kernel/src/snapshot.rs 92.68% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #311      +/-   ##
==========================================
+ Coverage   72.51%   72.63%   +0.11%     
==========================================
  Files          43       43              
  Lines        7783     7823      +40     
  Branches     7783     7823      +40     
==========================================
+ Hits         5644     5682      +38     
- Misses       1768     1770       +2     
  Partials      371      371              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

/// is invalid JSON. Unexpected/unrecoverable errors are returned as `Err` case and are assumed to
/// cause failure.
///
/// TODO: java kernel retries three times before failing, should we do the same?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my initial instinct would be, that if retry made sense, then on io errors when trying to read the file. I would expect this to be the responsibility of of the engine inside read_files. This I think is is good as is. OR do we get errors back, that we think are retryable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just assumed the read is happening over the network and isn't local? If it's local it wouldn't make sense to retry, but for a network read obviously it makes more sense

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my initial reaction is also to leave this to the engine. I don't think there is anything specific about reading the _last_checkpoint file and if failures occur across the network it should be handled the same for all our reads and taken care of within read_files

kernel/src/snapshot.rs Outdated Show resolved Hide resolved
let path = Path::from("valid/_last_checkpoint");
let invalid_path = Path::from("invalid/_last_checkpoint");

tokio::runtime::Runtime::new()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can all of this be replace with a #[tokio::test] instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea considered doing that but figured this called out async a little better so kept it as-is. I would prefer to have these be sync tests but I think we are forced to async with object_store..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does calling the async out here provide some value to someone trying to understand the test? Assuming they know rust async, doing the runtime plumbing just seems like boiler plate right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not. I would like to think about how to isolate the async code (really just needed for setup) so this is more of a TODO

@zachschuermann zachschuermann requested a review from hntd187 August 12, 2024 16:58
@zachschuermann zachschuermann merged commit b2bb39a into delta-io:main Aug 13, 2024
12 checks passed
@zachschuermann zachschuermann deleted the invalid-last-checkpoint-fix branch August 13, 2024 18:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants