-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Made imdb download (data_imdb) function atomic #14225
Made imdb download (data_imdb) function atomic #14225
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this contribution @Spaarsh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Spaarsh
I tried this out locally and found that now the conversion code was not triggered.
I pushed a fix to your branch locally (moving where the fi
was) as this PR has been hanging out for a while.
:~/Software/datafusion/benchmarks$ ./bench.sh data imdb
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: imdb
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Looking for imdb.tgz... found
Checking size... OK 1263193115
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/benchmarks$ echo "foo" > data/imdb/imdb.tgz
andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion/benchmarks$ ./bench.sh data imdb
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: imdb
DATA_DIR: /Users/andrewlamb/Software/datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Looking for imdb.tgz... found
Checking size... MISMATCH
Size less than expected: 4 found, 1263193115 required
Downloading IMDB dataset...
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
Oh thanks! Sorry for the oversight. I will be more careful from now on. Thanks again! You've been extremely patient!! |
No worries -- I am sorry we have had so many back and forths. I think this is looking nice now and thank you again |
Which issue does this PR close?
Closes #14128
Rationale for this change
Due to non-atomic downloads, the user would need to manually remove files/folders created by the script in order to not encounter the "file already exists" error.
What changes are included in this PR?
The changes primarily focus on adding size checks before initiating a download and after the completion of a download. If the size is mismatched, the partially downloaded file is then removed.
Are these changes tested?
Yes, these changes have been tested. The following are the results:
For the
benchmarks/bench.sh data imdb
command:Are there any user-facing changes?
Only that users shall now see if the imdb.tgz has already been installed, its status (partial/full) and if there is a size mismatch. The user shall also be informed to re-run the download command in case of partial download despite no user-side interruptions.