Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polars: rechunk and write string views #79

Merged
merged 3 commits into from
Jun 27, 2024

Conversation

ritchie46
Copy link

@ritchie46 ritchie46 commented Jun 17, 2024

Polars has changed and the current benchmark run doesn't run correctly anymore. On every benchmark run the memory map triggers a full rechunk and deserialization from arrow large string to string view, which isn't what is benchmarked here.

Can you re-run Polars with these changes and latest release?

P.S. I noted that duckb creates the tables with Float data type, so I updated Polars to use Float32 instead of Float64 to make it apples to apples. I think we should ensure that all solutions use similar data types.

@szarnyasg szarnyasg requested a review from Tmonster June 22, 2024 08:52
Copy link
Collaborator

@Tmonster Tmonster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Ritchie,

Thank you for the PR. Apologies for incorrectly testing polars.
I noticed also that the .ipc files were not being writing to the mounted drive, but rather to the EBS backed instance, so I fixed that as well.

Unfortunately my introduction of the $MOUNT_POINT env variable has caused a minor conflict. Do you mind fixing it? Then I will merge and start another run once #83 is also merged

@ritchie46
Copy link
Author

Ah, yes. I will do a rebase. Thanks for setting the correct writing locations.

On another note. I know this benchmark was inherited and always has a bit of vague startup allowances. (E.g. writing to tables/disks, converting string columns to enum/categories). I am curious if all solutions use the tables as is, or that some kind of metadata is used along the way.

@Tmonster Tmonster merged commit 02159fb into duckdblabs:master Jun 27, 2024
15 checks passed
@Tmonster
Copy link
Collaborator

I've looked at a lot of these startup scripts, and a number of them do seem to take the data and convert it to some internal format. Once in the internal format, I'm sure some of these solutions use stats. What I want to prevent, however, is code designed to make the system run faster for specific queries. For example, setting some flag setting for q7 groupby, then disabling it again for q8.

@ritchie46
Copy link
Author

@Tmonster any update on a rerun?

@Tmonster
Copy link
Collaborator

Tmonster commented Jul 2, 2024

Should have time to set it up and run it tonight. Should be able to update the results by the end of the week

@Tmonster
Copy link
Collaborator

Tmonster commented Jul 4, 2024

Hi Ritchie, I re-ran the benchmark on polars v1.0.0 and now I am having issues with q3 and q5 producing incorrect results compared to different versions. You can check out my branch here to look at the results, but this line filters out polars results for group by q3 and q5 so the report can run.

I wrote a little script to produce some of the different chk values. There is a tolerance, but these results are outside the tolerance

Again it seems to happen only with datasets that have null values

I don't really have the bandwidth to see what breaking change there was in the recent versions and try to fix it. Is there someone on your team (or you) that could look into this? Again just groupby q3 and q5 are affected.

┌────────────────┬───────────────────────────────────────────┬─────────────────────────────────────────┬───────────────────────┬─────────┬─────────┐
│      data      │                   chk_1                   │                  chk_2                  │       question        │   v1    │   v2    │
│    varchar     │                  varchar                  │                 varchar                 │        varchar        │ varchar │ varchar │
├────────────────┼───────────────────────────────────────────┼─────────────────────────────────────────┼───────────────────────┼─────────┼─────────┤
│ G1_1e9_1e2_5_0 │ 2849922064.0;474988853.389                │ 2849922064.0;475124704.0                │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;7600000111.0;47498842805.648 │ 2849922064.0;7600000111.0;47271714816.0 │ sum v1:v3 by id6      │ 0.20.31 │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;7600000111.0;47498842805.648 │ 2849922064.0;7600000111.0;47271718912.0 │ sum v1:v3 by id6      │ 0.20.31 │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;474988853.389                │ 2849922064.0;474947168.0                │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;4749467.632                    │ 28498857.0;4749467.5                    │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;7600000111.0;47498842805.648 │ 2849922064.0;7600000111.0;47271714816.0 │ sum v1:v3 by id6      │ 0.19.8  │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;47500172.658                  │ 284994735.0;47500520.0                  │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;474988853.389                │ 2849922064.0;474947168.0                │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;759971497.0;4750083909.4      │ 284994735.0;759971497.0;4749809152.0    │ sum v1:v3 by id6      │ 0.20.31 │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;4749467.632                    │ 28498857.0;4749467.5                    │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;4749467.632                    │ 28498857.0;4749468.0                    │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;75988394.0;474969574.048       │ 28498857.0;75988394.0;474969856.0       │ sum v1:v3 by id6      │ 0.19.8  │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;47500172.658                  │ 284994735.0;47500004.0                  │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;7600000111.0;47498842805.648 │ 2849922064.0;7600000111.0;47271718912.0 │ sum v1:v3 by id6      │ 0.19.8  │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;4749467.632                    │ 28498857.0;4749468.0                    │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;47500172.658                  │ 284994735.0;47500004.0                  │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;75988394.0;474969574.048       │ 28498857.0;75988394.0;474969856.0       │ sum v1:v3 by id6      │ 0.20.31 │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;47500172.658                  │ 284994735.0;47500520.0                  │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;759971497.0;4750083909.4      │ 284994735.0;759971497.0;4749809152.0    │ sum v1:v3 by id6      │ 0.19.8  │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;474988853.389                │ 2849922064.0;475124704.0                │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
├────────────────┴───────────────────────────────────────────┴─────────────────────────────────────────┴───────────────────────┴─────────┴─────────┤
│ 20 rows                                                                                                                                6 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

@ritchie46
Copy link
Author

I am not sure the previous versions are more "correct". Is there an exact correct answer? This might be due to associativity reasons.

@ritchie46
Copy link
Author

@Tmonster we switched from float64 to float32. I think this the reason for the differences you see.

@Tmonster
Copy link
Collaborator

Tmonster commented Jul 5, 2024

I'm running the benchmark again because I couldn't manage to reproduce the issue locally. Will update in an hour or so

@Tmonster
Copy link
Collaborator

Tmonster commented Jul 5, 2024

I am not sure the previous versions are more "correct". Is there an exact correct answer? This might be due to associativity reasons.
There isn't an exact correct answer, since with floating point arithmetic we allow for some tolerance. I think most of these results are within the tolerance.

I re-ran the benchmark, but am still getting invalid results. I'll try to narrow down more the exact question and dataset since the most recent results don't produce the same table as above. I can publish partial results, but I imagine you want to wait until all results are in?

@ritchie46
Copy link
Author

I can publish partial results

It at least would make more sense than the current results. :P

We test against these queries in our CI, so I am very surprised that there is something different, except for the floating point change. Did you run the new float32 on older versions as well? Or do you compare against older data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants