-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add snapshot testing to CLI & set up AWS mock #13672
base: main
Are you sure you want to change the base?
Conversation
# Conflicts: # datafusion-cli/Cargo.lock
This PR is ready ish for the reivew - but i want to merge #13576 first, so this one has a smaller diff |
# Conflicts: # datafusion-cli/Cargo.lock
.github/workflows/rust.yml
Outdated
- name: Setup Minio - S3-compatible storage | ||
working-directory: datafusion-cli | ||
run: | ||
echo "MINIO_CONTAINER=$(docker run -d -p 9000:9000 -e MINIO_ROOT_USER=TEST-DataFusionLogin -e MINIO_ROOT_PASSWORD=TEST-DataFusionPassword quay.io/minio/minio server /data)" >> $GITHUB_ENV | ||
- name: Run tests (excluding doctests, but with integration tests) | ||
working-directory: datafusion-cli | ||
run: cargo test --lib --tests --bins --all-features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb you sugessting copying object_store
approach for testing.
In object_store
, they use Localstack for S3 simulation. It works fine for testing, but the problem is that it doesn't actually validate the credentials.
In another part of object_store
, Minio is used, and it does validate credentials. So, I think we should switch to using Minio for testing here.
# Conflicts: # datafusion/common/src/config.rs # datafusion/core/tests/config_from_env.rs # datafusion/sqllogictest/test_files/information_schema.slt # docs/source/user-guide/configs.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @blaginin -- I think this is a great idea ❤️
I left some comments -- let me know what you think .
I have two concerns with this PRL
- The number of dependencies that are added for tests that have to be manually run (aka everyone who checks out datafusion will begin having to compile a bunch of new crates, including CI, but likely won't run these tests
- The test is not run automatically in CI and thus someone has to remember to run it. I predict this means it sill slowly bit rot (aka not get run and break and no one will notice)
|
||
glob!("sql/*.sql", |path| { | ||
let input = fs::read_to_string(path).unwrap(); | ||
assert_cmd_snapshot!(cli().pass_stdin(input)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW we have had really nice luck in influxdb_iox using insta, and instead of using the external snap shot testing we used inline snapshots (so the expected results are inline with the test code)
It looks like this:
fn test_union_not_nested() {
let plan = Arc::new(UnionExec::new(vec![other_node()]));
let opt = NestedUnion;
insta::assert_yaml_snapshot!(
OptimizationTest::new(plan, opt),
@r#"
input:
- " UnionExec"
- " EmptyExec"
output:
Ok:
- " UnionExec"
- " EmptyExec"
"#
);
}
Is that possible with assert_cmt_snapshot?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is absolutely possible! My idea was to make it closer to slt
and separate the code from the data. Setting up the suite is quite hard (we need to create a new user, upload a test file, etc.), but with the current approach, we only need to do it once rather than for every single test.
However, I'm happy to make changes if you'd prefer it to be inline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer it to be inline, but perhaps we can do that as a follow on PR
(the way we do this in influxdb_iox is that we have a test function that returns a string and then compare the string with the assert_yaml_snapshot
-- that way the setup code isn't replicated and we can still have the tests inline
|
||
```shell | ||
cargo install cargo-insta | ||
cargo insta review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use this in influxdb_iox and I find it super useful
#[case("nd-json")] | ||
#[case("automatic")] | ||
#[test] | ||
fn test_cli_format<'a>(#[case] format: &'a str) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why we need all the new aws-s3-sdk dependencies and it looks like it is needed in order to programmatically setup the bucket.
I think minio also supports just serving the contents of directories as s3 files
Did you consider looking at just configuring minio directly rather than using the AWS S3 ASK just to create. a bucket?
I know it doesn't seem like a big deal, but this PR adds many dependencies (that is why Cargo.lock is so big) and we try to keep the dependency chain down as much as possible (it is already large)
I realize it is only for datafusion-cli dev, but that dev directly impacts maintenance (
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think minio also supports just serving the contents of directories as s3 files
I believe this feature got removed - minio/minio#15496 (comment)
However, I did something similar and removed aws sdk - would this work?
Thank you for checking!! I actually think it does work in the CI (hence so many commits in this PR to make it work 😀) https://github.com/apache/datafusion/actions/runs/12435578988/job/34722247404?pr=13672 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @blaginin
I think this is a nice improvement in testing.
Can you please try and remove the need for the 10s of new dependencies that come in with aws-sdk-s3
?
Otherwise from my perspective this PR is ready to go
# Conflicts: # .github/workflows/rust.yml
Sorry, was on holidays. Will remove |
Which issue does this PR close?
Related to #13456 (comment)
Rationale for this change
We currently don't test whether our external integrations actually work: as a result #13576 has happened.
What changes are included in this PR?
Integration tests for S3.
Are there any user-facing changes?
No.