Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: list log files with checkpoint when version is None #312

Merged
merged 5 commits into from
Aug 14, 2024

Conversation

mervynzhang
Copy link
Contributor

@mervynzhang mervynzhang commented Aug 9, 2024

Previously, if read_last_checkpoint returned a valid _last_checkpoint hint but the version requested was None (i.e. requesting the lastest version) we would still list_log_files without listing from the checkpoint hint. This change will detect when we have a valid _last_checkpoint and request the latest version and instead list_log_files_with_checkpoint

A side effect, we will get following error when load a table via s3proxy with a lot log files (more than 1000), but the same issue doesn't occur on delta-rs, after some troubleshooting, the error can be fixed by add this line from delta-rs

https://github.com/delta-io/delta-rs/blob/main/crates/core/src/kernel/snapshot/log_segment.rs#L98

ObjectStore(
    Generic {
        store: "S3",
        source: ListRequest {
            source: Client {
                status: 400,
                body: Some(
                    "<?xml version='1.0' encoding='UTF-8'?><Error><Code>InvalidArgument</Code><Message>Bad Request</Message><RequestId>4442587FB7D0A2F9</RequestId></Error>",
                ),
            },
        },
    },
)

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mervynhang thanks for the PR! It does look like we aren't reading from the last checkpoint when asking for the latest version - thanks for fixing. Curious though if there is some constraint on the LIST API that's causing the 400? we should still be able to read the table without the checkpoint

@mervynzhang
Copy link
Contributor Author

mervynzhang commented Aug 9, 2024

Hi @zachschuermann ,

We have to use s3proxy in our environment, s3proxy require that continuation-token and start-after cannot be both set

https://github.com/gaul/s3proxy/blob/4976e170c3c068275201a79d329f953f685fb808/src/main/java/org/gaul/s3proxy/S3ProxyHandler.java#L1389C17-L1389C35

delta-kernel-rs will set start-after
https://github.com/delta-incubator/delta-kernel-rs/blob/main/kernel/src/snapshot.rs#L367

but when there are more than 1000 items, continuation-token will be set automatically and s3proxy will return error.
https://github.com/apache/arrow-rs/blob/3e02689e3464bc8cf929a0d116888fb6f59999fa/object_store/src/aws/client.rs#L726

@zachschuermann
Copy link
Collaborator

Ah great, thanks for the context. So, put another way, attempting to list > 1000 items isn't supported for ObjectStore on S3 with s3proxy? Is that a known issue?

@zachschuermann
Copy link
Collaborator

(the fix here is a useful optimization but more of a workaround for the underlying issue in ObjectStore/S3proxy it seems)

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and we will also need to disable some tests in golden_tests.rs (now that checkpoint reading is kicking in like it should): L411-412

- golden_test!("v2-checkpoint-json", latest_snapshot_test); // passing without v2 checkpoint support
- golden_test!("v2-checkpoint-parquet", latest_snapshot_test); // passing without v2 checkpoint support
+ skip_test!("v2-checkpoint-json": "v2 checkpoint not supported");
+ skip_test!("v2-checkpoint-parquet": "v2 checkpoint not supported");

kernel/src/snapshot.rs Outdated Show resolved Hide resolved
Co-authored-by: Zach Schuermann <[email protected]>
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mervynhang if we modify the golden_tables tests (in my comment above) we should be good to go!

@mervynzhang
Copy link
Contributor Author

Hi @mervynhang if we modify the golden_tables tests (in my comment above) we should be good to go!

Hi @zachschuermann golden_tables tests has been updated, please let me know if anything else is needed. Thanks

@zachschuermann zachschuermann changed the title match delta-rs to avoid an error fix: list log files with checkpoint when version is None Aug 13, 2024
Copy link

codecov bot commented Aug 13, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 1 line in your changes missing coverage. Please review.

Project coverage is 73.02%. Comparing base (b2bb39a) to head (f417837).
Report is 1 commits behind head on main.

Files Patch % Lines
kernel/src/snapshot.rs 50.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #312      +/-   ##
==========================================
+ Coverage   72.63%   73.02%   +0.39%     
==========================================
  Files          43       43              
  Lines        7823     7825       +2     
  Branches     7823     7825       +2     
==========================================
+ Hits         5682     5714      +32     
+ Misses       1770     1736      -34     
- Partials      371      375       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zachschuermann
Copy link
Collaborator

could we update the PR description to reflect the changes being made? feel free to keep the original context but add something like:

previously, if read_last_checkpoint returned a valid _last_checkpoint hint but the version requested was None (i.e. requesting the lastest version) we would still list_log_files without listing from the checkpoint hint. This change will detect when we have a valid _last_checkpoint and request the latest version and instead list_log_files_with_checkpoint

@mervynzhang
Copy link
Contributor Author

Hi @zachschuermann , PR description updated. Thanks

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM I'll help take some of the testing in a follow-up PR!

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@zachschuermann zachschuermann merged commit e4c07dd into delta-io:main Aug 14, 2024
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants