-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Making the data_imdb and clickbench_1 functions atomic. #14129
Making the data_imdb and clickbench_1 functions atomic. #14129
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I have tried and it's working as expected.
I have a just nitpicking suggestion here, it would be good to also remove the dataset parent directory
data/
-- imdb/ # This folder is not deleted
---- dataset.tgz # Now only this file is cleaned up
Even I was thinking the same! But I wanted to ensure first that the same directory isn't being used to store other files. Since I would be using the I also planned to make the cleanup function as a util function. So that it can be used wherever atomicity was needed in the Would like your thoughts on this! |
I think the convention followed now is that the same subdirectory under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @Spaarsh and @2010YOUY01
I tested this locally and it seems to work for CTRL-C
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ ./benchmarks/bench.sh data clickbench_1
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: clickbench_1
DATA_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Checking hits.parquet...... downloading https://datasets.clickhouse.com/hits_compatible/hits.parquet (14GB) ... --2025-01-17 12:59:34-- https://datasets.clickhouse.com/hits_compatible/hits.parquet
Resolving datasets.clickhouse.com (datasets.clickhouse.com)... 2606:4700:3108::ac42:2b07, 2606:4700:3108::ac42:28f9, 172.66.43.7, ...
Connecting to datasets.clickhouse.com (datasets.clickhouse.com)|2606:4700:3108::ac42:2b07|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14779976446 (14G) [binary/octet-stream]
Saving to: ‘hits.parquet’
hits.parquet 1%[> ] 145.14M 39.8MB/s eta 6m 11s ^C
Cleaning up downloaded files...
However, I think this solution has the downside that it only will fix partial downloads when the shell script is interrupted (aka hit ctrl-C). I would still leave around invalid data if the script was SIGKILL'd (or maybe the download timed out, I didn't check)
I am also worried about the complexity introduced by adding trap
I also think checking for partial download of clickbench is already done (by checking file size). Perhaps we can add a similar check for file size for the imdb.tgz file too
wget --continue ${URL} | ||
fi | ||
echo " Done" | ||
if ! wget --continue ${URL}; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the check above tests for the file size and already detects partial / failed previous downloads
if test "${OUTPUT_SIZE}" = "14779976446"; then
So I am not sure this is necessary
# Downloads the csv.gz files IMDB datasets from Peter Boncz's homepage(one of the JOB paper authors) | ||
# https://event.cwi.nl/da/job/imdb.tgz | ||
data_imdb() { | ||
local imdb_dir="${DATA_DIR}/imdb" | ||
local imdb_temp_gz="${imdb_dir}/imdb.tgz" | ||
local imdb_url="https://event.cwi.nl/da/job/imdb.tgz" | ||
|
||
# Set trap with parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can add a check for file size of the imdb.tgz file rather than just checking for its existence
if [ ! -f "${imdb_dir}/imdb.tgz" ]; then
@alamb your suggestions do make sense. Perhaps all that is needed is a size checking function. My addition would be to dynamically check for the expected size of the file to be transferred by running the following command first:
And for cases where a 404 response is returned, we can check for a 404-type status response and, again, purge the file downloaded. This can apply to other 4xx and 5xx responses as well. The above two points combined solve the problem of partial downloads (without us relying on statically entered values) and erroneous responses as well. |
Since the file isn't expected to change, I would suggest simply hard coding the expected size in the bash script As for 404 being returned, I don't understand why that would need special handling 🤔 If we had the hard coded expected size and a 404 was returned then the output file wouldn't be the right size so running the script again would ensure it was correct 🤔 |
The special handling for 404 was for the case if we went for the dynamic size thing😅. My only concern is that we've already faced one issue due the website owners changing the url. If they so happen to change the dataset size as well, the static size shall lead to unexpected behavior again. Just trying to cover all the bases so that no issue along these lines occurs again. Or perhaps I'm focusing too much on a simple script😅. But if we go for the static size, the solution is pretty straightforward. Should I go forward with it then? |
That is my suggestion. I agree it won't cover all the cases but I think it will get the common problems and it is quite simple |
BTW thank you very much for working on this |
Agreed! I'll go forward with your suggested approach. Thanks for your patience!! |
The suggested changes have been implemented and have been committed via another PR #14225. Hence I am closing this PR. |
Which issue does this PR close?
Closes #14128
Rationale for this change
Due to non-atomic downloads, the user would need to manually remove files/folders created by the script in order to not encounter the "file already exists" error.
What changes are included in this PR?
The changes primarily focus on adding traps for catching keyboard interrupts and failures in the process completion and reverting the changes if either of the two are encountered.
Are these changes tested?
Yes, these changes have been tested. The following are the results:
For the benchmarks/bench.sh data imdb command:
For the benchmarks/bench.sh data clickbench_1 command:
Are there any user-facing changes?