fix: allow kernel to read tables with invalid _last_checkpoint #311

zachschuermann · 2024-08-08T19:53:28Z

Previously, read_last_checkpoint would return Err(<json error>) if json parsing of the _last_checkpoint file failed. This caused the upstream call to fail due to propagation of the error. Instead we now return None from read_last_checkpoint in this case - similar to what is returned if the _last_checkpoint is not found. This allows log replay to continue without the hint and read the table.

Additionally, this unblocks the corrupted-last-checkpoint-kernel golden table test.

codecov · 2024-08-08T19:58:07Z

Codecov Report

Attention: Patch coverage is 92.68293% with 3 lines in your changes missing coverage. Please review.

Project coverage is 72.63%. Comparing base (3143264) to head (de18057).

Files	Patch %	Lines
kernel/src/snapshot.rs	92.68%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #311      +/-   ##
==========================================
+ Coverage   72.51%   72.63%   +0.11%     
==========================================
  Files          43       43              
  Lines        7783     7823      +40     
  Branches     7783     7823      +40     
==========================================
+ Hits         5644     5682      +38     
- Misses       1768     1770       +2     
  Partials      371      371

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

roeap · 2024-08-08T19:58:58Z

kernel/src/snapshot.rs

+/// is invalid JSON. Unexpected/unrecoverable errors are returned as `Err` case and are assumed to
+/// cause failure.
+///
+/// TODO: java kernel retries three times before failing, should we do the same?


my initial instinct would be, that if retry made sense, then on io errors when trying to read the file. I would expect this to be the responsibility of of the engine inside read_files. This I think is is good as is. OR do we get errors back, that we think are retryable?

Is it just assumed the read is happening over the network and isn't local? If it's local it wouldn't make sense to retry, but for a network read obviously it makes more sense

my initial reaction is also to leave this to the engine. I don't think there is anything specific about reading the _last_checkpoint file and if failures occur across the network it should be handled the same for all our reads and taken care of within read_files

kernel/src/snapshot.rs

hntd187 · 2024-08-09T03:17:53Z

kernel/src/snapshot.rs

+        let path = Path::from("valid/_last_checkpoint");
+        let invalid_path = Path::from("invalid/_last_checkpoint");
+
+        tokio::runtime::Runtime::new()


can all of this be replace with a #[tokio::test] instead?

yea considered doing that but figured this called out async a little better so kept it as-is. I would prefer to have these be sync tests but I think we are forced to async with object_store..

Does calling the async out here provide some value to someone trying to understand the test? Assuming they know rust async, doing the runtime plumbing just seems like boiler plate right?

probably not. I would like to think about how to isolate the async code (really just needed for setup) so this is more of a TODO

return None for invalid checkpoint, allow log replay to continue

12921d7

zachschuermann requested review from nicklan, azdavis, hntd187 and roeap August 8, 2024 19:54

roeap approved these changes Aug 8, 2024

View reviewed changes

zachschuermann added 2 commits August 8, 2024 13:59

add test

8ad5576

fmt

c94584a

hntd187 reviewed Aug 9, 2024

View reviewed changes

kernel/src/snapshot.rs Outdated Show resolved Hide resolved

hntd187 reviewed Aug 9, 2024

View reviewed changes

kernel/src/snapshot.rs Show resolved Hide resolved

hntd187 reviewed Aug 9, 2024

View reviewed changes

inspect_err

de18057

zachschuermann requested a review from hntd187 August 12, 2024 16:58

hntd187 approved these changes Aug 12, 2024

View reviewed changes

zachschuermann merged commit b2bb39a into delta-io:main Aug 13, 2024
12 checks passed

zachschuermann deleted the invalid-last-checkpoint-fix branch August 13, 2024 18:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: allow kernel to read tables with invalid _last_checkpoint #311

fix: allow kernel to read tables with invalid _last_checkpoint #311

zachschuermann commented Aug 8, 2024

codecov bot commented Aug 8, 2024 •

edited

Loading

roeap Aug 8, 2024

hntd187 Aug 8, 2024

zachschuermann Aug 8, 2024

hntd187 Aug 9, 2024

zachschuermann Aug 9, 2024

hntd187 Aug 12, 2024

zachschuermann Aug 13, 2024

fix: allow kernel to read tables with invalid _last_checkpoint #311

fix: allow kernel to read tables with invalid _last_checkpoint #311

Conversation

zachschuermann commented Aug 8, 2024

codecov bot commented Aug 8, 2024 • edited Loading

Codecov Report

roeap Aug 8, 2024

Choose a reason for hiding this comment

hntd187 Aug 8, 2024

Choose a reason for hiding this comment

zachschuermann Aug 8, 2024

Choose a reason for hiding this comment

hntd187 Aug 9, 2024

Choose a reason for hiding this comment

zachschuermann Aug 9, 2024

Choose a reason for hiding this comment

hntd187 Aug 12, 2024

Choose a reason for hiding this comment

zachschuermann Aug 13, 2024

Choose a reason for hiding this comment

codecov bot commented Aug 8, 2024 •

edited

Loading