Making the data_imdb and clickbench_1 functions atomic. #14129

Spaarsh · 2025-01-14T18:33:49Z

Which issue does this PR close?

Rationale for this change

Due to non-atomic downloads, the user would need to manually remove files/folders created by the script in order to not encounter the "file already exists" error.

What changes are included in this PR?

The changes primarily focus on adding traps for catching keyboard interrupts and failures in the process completion and reverting the changes if either of the two are encountered.

Are these changes tested?

Yes, these changes have been tested. The following are the results:

For the benchmarks/bench.sh data imdb command:

For the benchmarks/bench.sh data clickbench_1 command:

Are there any user-facing changes?

2010YOUY01

Thank you, I have tried and it's working as expected.

I have a just nitpicking suggestion here, it would be good to also remove the dataset parent directory

data/
-- imdb/ # This folder is not deleted
---- dataset.tgz # Now only this file is cleaned up

Spaarsh · 2025-01-15T03:31:16Z

Even I was thinking the same! But I wanted to ensure first that the same directory isn't being used to store other files. Since I would be using the rm -rf command here, that data would've been removed too without any warning. There was also the concern of making it easier for subsequent developers to work on this. In case the imdb directory finds a new usecase, the function would have to be changed at that point.

I also planned to make the cleanup function as a util function. So that it can be used wherever atomicity was needed in the bench.sh , hence I wanted to make it as general as possible.

Would like your thoughts on this!
Thanks!

2010YOUY01 · 2025-01-15T03:48:10Z

Even I was thinking the same! But I wanted to ensure first that the same directory isn't being used to store other files.

I think the convention followed now is that the same subdirectory under data/ can't hold different datasets.
However, I agree it’s safer to address this later, when we can ensure all datasets follow this rule.

alamb

Thank you very much @Spaarsh and @2010YOUY01

I tested this locally and it seems to work for CTRL-C

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion2$ ./benchmarks/bench.sh data clickbench_1
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: clickbench_1
DATA_DIR: /Users/andrewlamb/Software/datafusion2/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Checking hits.parquet...... downloading https://datasets.clickhouse.com/hits_compatible/hits.parquet (14GB) ... --2025-01-17 12:59:34--  https://datasets.clickhouse.com/hits_compatible/hits.parquet
Resolving datasets.clickhouse.com (datasets.clickhouse.com)... 2606:4700:3108::ac42:2b07, 2606:4700:3108::ac42:28f9, 172.66.43.7, ...
Connecting to datasets.clickhouse.com (datasets.clickhouse.com)|2606:4700:3108::ac42:2b07|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14779976446 (14G) [binary/octet-stream]
Saving to: ‘hits.parquet’

hits.parquet                                                 1%[>                                                                                                                                     ] 145.14M  39.8MB/s    eta 6m 11s ^C
Cleaning up downloaded files...

However, I think this solution has the downside that it only will fix partial downloads when the shell script is interrupted (aka hit ctrl-C). I would still leave around invalid data if the script was SIGKILL'd (or maybe the download timed out, I didn't check)

I am also worried about the complexity introduced by adding trap

I also think checking for partial download of clickbench is already done (by checking file size). Perhaps we can add a similar check for file size for the imdb.tgz file too

alamb · 2025-01-17T18:06:59Z

benchmarks/bench.sh

-        wget --continue ${URL}
-    fi
-    echo " Done"
+        if ! wget --continue ${URL}; then


I think the check above tests for the file size and already detects partial / failed previous downloads

if test "${OUTPUT_SIZE}" = "14779976446"; then

So I am not sure this is necessary

alamb · 2025-01-17T18:08:06Z

benchmarks/bench.sh

 # Downloads the csv.gz files IMDB datasets from Peter Boncz's homepage(one of the JOB paper authors)
 # https://event.cwi.nl/da/job/imdb.tgz
 data_imdb() {
    local imdb_dir="${DATA_DIR}/imdb"
    local imdb_temp_gz="${imdb_dir}/imdb.tgz"
    local imdb_url="https://event.cwi.nl/da/job/imdb.tgz"

+    # Set trap with parameter


Perhaps we can add a check for file size of the imdb.tgz file rather than just checking for its existence

if [ ! -f "${imdb_dir}/imdb.tgz" ]; then

Spaarsh · 2025-01-18T08:29:25Z

@alamb your suggestions do make sense. Perhaps all that is needed is a size checking function. My addition would be to dynamically check for the expected size of the file to be transferred by running the following command first:

datafusion$ curl -sI https://event.cwi.nl/da/job/imdb.tgz | grep -i content-length
Content-Length: 1263193115

And for cases where a 404 response is returned, we can check for a 404-type status response and, again, purge the file downloaded. This can apply to other 4xx and 5xx responses as well.

The above two points combined solve the problem of partial downloads (without us relying on statically entered values) and erroneous responses as well.

alamb · 2025-01-18T12:27:18Z

@alamb your suggestions do make sense. Perhaps all that is needed is a size checking function. My addition would be to dynamically check for the expected size of the file to be transferred by running the following command first:

Since the file isn't expected to change, I would suggest simply hard coding the expected size in the bash script

As for 404 being returned, I don't understand why that would need special handling 🤔 If we had the hard coded expected size and a 404 was returned then the output file wouldn't be the right size so running the script again would ensure it was correct 🤔

Spaarsh · 2025-01-18T12:44:13Z

The special handling for 404 was for the case if we went for the dynamic size thing😅. My only concern is that we've already faced one issue due the website owners changing the url. If they so happen to change the dataset size as well, the static size shall lead to unexpected behavior again. Just trying to cover all the bases so that no issue along these lines occurs again. Or perhaps I'm focusing too much on a simple script😅.

But if we go for the static size, the solution is pretty straightforward. Should I go forward with it then?

alamb · 2025-01-18T13:11:21Z

But if we go for the static size, the solution is pretty straightforward. Should I go forward with it then?

That is my suggestion. I agree it won't cover all the cases but I think it will get the common problems and it is quite simple

alamb · 2025-01-18T13:11:30Z

BTW thank you very much for working on this

Spaarsh · 2025-01-18T13:13:21Z

But if we go for the static size, the solution is pretty straightforward. Should I go forward with it then?

That is my suggestion. I agree it won't cover all the cases but I think it will get the common problems and it is quite simple

Agreed! I'll go forward with your suggested approach. Thanks for your patience!!

…s-atomic

This reverts commit f558cd9.

Spaarsh · 2025-01-21T18:17:37Z

The suggested changes have been implemented and have been committed via another PR #14225. Hence I am closing this PR.

Making the data_imdb and clickbench_1 functions atomic.

f558cd9

2010YOUY01 approved these changes Jan 15, 2025

View reviewed changes

alamb reviewed Jan 17, 2025

View reviewed changes

Spaarsh and others added 3 commits January 21, 2025 22:30

Merge branch 'apache:main' into enhancement/making-bench-sh-operation…

556126a

…s-atomic

Revert "Making the data_imdb and clickbench_1 functions atomic."

b9429e8

This reverts commit f558cd9.

Making the data_imdb and clickbench_1 functions atomic.

e1a0a7a

Spaarsh closed this Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making the data_imdb and clickbench_1 functions atomic. #14129

Making the data_imdb and clickbench_1 functions atomic. #14129

Spaarsh commented Jan 14, 2025

2010YOUY01 left a comment

Spaarsh commented Jan 15, 2025

2010YOUY01 commented Jan 15, 2025

alamb left a comment

alamb Jan 17, 2025

alamb Jan 17, 2025

Spaarsh commented Jan 18, 2025 •

edited

Loading

alamb commented Jan 18, 2025

Spaarsh commented Jan 18, 2025

alamb commented Jan 18, 2025

alamb commented Jan 18, 2025

Spaarsh commented Jan 18, 2025 •

edited

Loading

Spaarsh commented Jan 21, 2025

Making the data_imdb and clickbench_1 functions atomic. #14129

Making the data_imdb and clickbench_1 functions atomic. #14129

Conversation

Spaarsh commented Jan 14, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

For the benchmarks/bench.sh data imdb command:

For the benchmarks/bench.sh data clickbench_1 command:

Are there any user-facing changes?

2010YOUY01 left a comment

Choose a reason for hiding this comment

Spaarsh commented Jan 15, 2025

2010YOUY01 commented Jan 15, 2025

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 17, 2025

Choose a reason for hiding this comment

alamb Jan 17, 2025

Choose a reason for hiding this comment

Spaarsh commented Jan 18, 2025 • edited Loading

alamb commented Jan 18, 2025

Spaarsh commented Jan 18, 2025

alamb commented Jan 18, 2025

alamb commented Jan 18, 2025

Spaarsh commented Jan 18, 2025 • edited Loading

Spaarsh commented Jan 21, 2025

Spaarsh commented Jan 18, 2025 •

edited

Loading

Spaarsh commented Jan 18, 2025 •

edited

Loading