Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Made imdb download (data_imdb) function atomic #14225

Merged

Conversation

Spaarsh
Copy link
Contributor

@Spaarsh Spaarsh commented Jan 21, 2025

Which issue does this PR close?

Closes #14128

Rationale for this change

Due to non-atomic downloads, the user would need to manually remove files/folders created by the script in order to not encounter the "file already exists" error.

What changes are included in this PR?

The changes primarily focus on adding size checks before initiating a download and after the completion of a download. If the size is mismatched, the partially downloaded file is then removed.

Are these changes tested?

Yes, these changes have been tested. The following are the results:

For the benchmarks/bench.sh data imdb command:

atomic_data_imdb

Are there any user-facing changes?

Only that users shall now see if the imdb.tgz has already been installed, its status (partial/full) and if there is a size mismatch. The user shall also be informed to re-run the download command in case of partial download despite no user-side interruptions.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @Spaarsh

benchmarks/bench.sh Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Spaarsh

I tried this out locally and found that now the conversion code was not triggered.

I pushed a fix to your branch locally (moving where the fi was) as this PR has been hanging out for a while.

:~/Software/datafusion/benchmarks$ ./bench.sh data imdb
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: imdb
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Looking for imdb.tgz... found
Checking size... OK 1263193115
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/benchmarks$ echo "foo" > data/imdb/imdb.tgz
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/benchmarks$ ./bench.sh data imdb
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: imdb
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Looking for imdb.tgz... found
Checking size... MISMATCH
Size less than expected: 4 found, 1263193115 required
Downloading IMDB dataset...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Jan 23, 2025

Thank you @Spaarsh

I tried this out locally and found that now the conversion code was not triggered.

I pushed a fix to your branch locally (moving where the fi was) as this PR has been hanging out for a while.

:~/Software/datafusion/benchmarks$ ./bench.sh data imdb
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: imdb
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Looking for imdb.tgz... found
Checking size... OK 1263193115
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/benchmarks$ echo "foo" > data/imdb/imdb.tgz
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/benchmarks$ ./bench.sh data imdb
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: imdb
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Looking for imdb.tgz... found
Checking size... MISMATCH
Size less than expected: 4 found, 1263193115 required
Downloading IMDB dataset...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

Oh thanks! Sorry for the oversight. I will be more careful from now on. Thanks again! You've been extremely patient!!

@alamb
Copy link
Contributor

alamb commented Jan 23, 2025

Oh thanks! Sorry for the oversight. I will be more careful from now on. Thanks again! You've been extremely patient!!

No worries -- I am sorry we have had so many back and forths. I think this is looking nice now and thank you again

@alamb alamb merged commit 72c0df4 into apache:main Jan 23, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Making Downloads Atomic in bench.sh
2 participants