Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making the data_imdb and clickbench_1 functions atomic. #14129

Conversation

Spaarsh
Copy link
Contributor

@Spaarsh Spaarsh commented Jan 14, 2025

Which issue does this PR close?

Closes #14128

Rationale for this change

Due to non-atomic downloads, the user would need to manually remove files/folders created by the script in order to not encounter the "file already exists" error.

What changes are included in this PR?

The changes primarily focus on adding traps for catching keyboard interrupts and failures in the process completion and reverting the changes if either of the two are encountered.

Are these changes tested?

Yes, these changes have been tested. The following are the results:

For the benchmarks/bench.sh data imdb command:

20241230_012510

For the benchmarks/bench.sh data clickbench_1 command:

20241230_012240

Are there any user-facing changes?

Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I have tried and it's working as expected.

I have a just nitpicking suggestion here, it would be good to also remove the dataset parent directory

data/
-- imdb/ # This folder is not deleted
---- dataset.tgz # Now only this file is cleaned up

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Jan 15, 2025

Even I was thinking the same! But I wanted to ensure first that the same directory isn't being used to store other files. Since I would be using the rm -rf command here, that data would've been removed too without any warning. There was also the concern of making it easier for subsequent developers to work on this. In case the imdb directory finds a new usecase, the function would have to be changed at that point.

I also planned to make the cleanup function as a util function. So that it can be used wherever atomicity was needed in the bench.sh , hence I wanted to make it as general as possible.

Would like your thoughts on this!
Thanks!

@2010YOUY01
Copy link
Contributor

Even I was thinking the same! But I wanted to ensure first that the same directory isn't being used to store other files.

I think the convention followed now is that the same subdirectory under data/ can't hold different datasets.
However, I agree it’s safer to address this later, when we can ensure all datasets follow this rule.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @Spaarsh and @2010YOUY01

I tested this locally and it seems to work for CTRL-C

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ ./benchmarks/bench.sh data clickbench_1
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: clickbench_1
DATA_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Checking hits.parquet...... downloading https://datasets.clickhouse.com/hits_compatible/hits.parquet (14GB) ... --2025-01-17 12:59:34--  https://datasets.clickhouse.com/hits_compatible/hits.parquet
Resolving datasets.clickhouse.com (datasets.clickhouse.com)... 2606:4700:3108::ac42:2b07, 2606:4700:3108::ac42:28f9, 172.66.43.7, ...
Connecting to datasets.clickhouse.com (datasets.clickhouse.com)|2606:4700:3108::ac42:2b07|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14779976446 (14G) [binary/octet-stream]
Saving to: ‘hits.parquet’

hits.parquet                                                 1%[>                                                                                                                                     ] 145.14M  39.8MB/s    eta 6m 11s ^C
Cleaning up downloaded files...

However, I think this solution has the downside that it only will fix partial downloads when the shell script is interrupted (aka hit ctrl-C). I would still leave around invalid data if the script was SIGKILL'd (or maybe the download timed out, I didn't check)

I am also worried about the complexity introduced by adding trap

I also think checking for partial download of clickbench is already done (by checking file size). Perhaps we can add a similar check for file size for the imdb.tgz file too

wget --continue ${URL}
fi
echo " Done"
if ! wget --continue ${URL}; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the check above tests for the file size and already detects partial / failed previous downloads

    if test "${OUTPUT_SIZE}" = "14779976446"; then

So I am not sure this is necessary

# Downloads the csv.gz files IMDB datasets from Peter Boncz's homepage(one of the JOB paper authors)
# https://event.cwi.nl/da/job/imdb.tgz
data_imdb() {
local imdb_dir="${DATA_DIR}/imdb"
local imdb_temp_gz="${imdb_dir}/imdb.tgz"
local imdb_url="https://event.cwi.nl/da/job/imdb.tgz"

# Set trap with parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can add a check for file size of the imdb.tgz file rather than just checking for its existence

        if [ ! -f "${imdb_dir}/imdb.tgz" ]; then

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Jan 18, 2025

@alamb your suggestions do make sense. Perhaps all that is needed is a size checking function. My addition would be to dynamically check for the expected size of the file to be transferred by running the following command first:

datafusion$ curl -sI https://event.cwi.nl/da/job/imdb.tgz | grep -i content-length
Content-Length: 1263193115

And for cases where a 404 response is returned, we can check for a 404-type status response and, again, purge the file downloaded. This can apply to other 4xx and 5xx responses as well.

The above two points combined solve the problem of partial downloads (without us relying on statically entered values) and erroneous responses as well.

@alamb
Copy link
Contributor

alamb commented Jan 18, 2025

@alamb your suggestions do make sense. Perhaps all that is needed is a size checking function. My addition would be to dynamically check for the expected size of the file to be transferred by running the following command first:

Since the file isn't expected to change, I would suggest simply hard coding the expected size in the bash script

As for 404 being returned, I don't understand why that would need special handling 🤔 If we had the hard coded expected size and a 404 was returned then the output file wouldn't be the right size so running the script again would ensure it was correct 🤔

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Jan 18, 2025

The special handling for 404 was for the case if we went for the dynamic size thing😅. My only concern is that we've already faced one issue due the website owners changing the url. If they so happen to change the dataset size as well, the static size shall lead to unexpected behavior again. Just trying to cover all the bases so that no issue along these lines occurs again. Or perhaps I'm focusing too much on a simple script😅.

But if we go for the static size, the solution is pretty straightforward. Should I go forward with it then?

@alamb
Copy link
Contributor

alamb commented Jan 18, 2025

But if we go for the static size, the solution is pretty straightforward. Should I go forward with it then?

That is my suggestion. I agree it won't cover all the cases but I think it will get the common problems and it is quite simple

@alamb
Copy link
Contributor

alamb commented Jan 18, 2025

BTW thank you very much for working on this

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Jan 18, 2025

But if we go for the static size, the solution is pretty straightforward. Should I go forward with it then?

That is my suggestion. I agree it won't cover all the cases but I think it will get the common problems and it is quite simple

Agreed! I'll go forward with your suggested approach. Thanks for your patience!!

@Spaarsh
Copy link
Contributor Author

Spaarsh commented Jan 21, 2025

The suggested changes have been implemented and have been committed via another PR #14225. Hence I am closing this PR.

@Spaarsh Spaarsh closed this Jan 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Making Downloads Atomic in bench.sh
3 participants