Polars: rechunk and write string views #79

ritchie46 · 2024-06-17T07:16:31Z

Polars has changed and the current benchmark run doesn't run correctly anymore. On every benchmark run the memory map triggers a full rechunk and deserialization from arrow large string to string view, which isn't what is benchmarked here.

Can you re-run Polars with these changes and latest release?

P.S. I noted that duckb creates the tables with Float data type, so I updated Polars to use Float32 instead of Float64 to make it apples to apples. I think we should ensure that all solutions use similar data types.

Tmonster

Hi Ritchie,

Thank you for the PR. Apologies for incorrectly testing polars.
I noticed also that the .ipc files were not being writing to the mounted drive, but rather to the EBS backed instance, so I fixed that as well.

Unfortunately my introduction of the $MOUNT_POINT env variable has caused a minor conflict. Do you mind fixing it? Then I will merge and start another run once #83 is also merged

ritchie46 · 2024-06-26T13:54:56Z

Ah, yes. I will do a rebase. Thanks for setting the correct writing locations.

On another note. I know this benchmark was inherited and always has a bit of vague startup allowances. (E.g. writing to tables/disks, converting string columns to enum/categories). I am curious if all solutions use the tables as is, or that some kind of metadata is used along the way.

Tmonster · 2024-06-27T11:07:02Z

I've looked at a lot of these startup scripts, and a number of them do seem to take the data and convert it to some internal format. Once in the internal format, I'm sure some of these solutions use stats. What I want to prevent, however, is code designed to make the system run faster for specific queries. For example, setting some flag setting for q7 groupby, then disabling it again for q8.

ritchie46 · 2024-07-01T16:03:12Z

@Tmonster any update on a rerun?

Tmonster · 2024-07-02T14:56:15Z

Should have time to set it up and run it tonight. Should be able to update the results by the end of the week

Tmonster · 2024-07-04T09:00:19Z

Hi Ritchie, I re-ran the benchmark on polars v1.0.0 and now I am having issues with q3 and q5 producing incorrect results compared to different versions. You can check out my branch here to look at the results, but this line filters out polars results for group by q3 and q5 so the report can run.

I wrote a little script to produce some of the different chk values. There is a tolerance, but these results are outside the tolerance

Again it seems to happen only with datasets that have null values

I don't really have the bandwidth to see what breaking change there was in the recent versions and try to fix it. Is there someone on your team (or you) that could look into this? Again just groupby q3 and q5 are affected.

┌────────────────┬───────────────────────────────────────────┬─────────────────────────────────────────┬───────────────────────┬─────────┬─────────┐
│      data      │                   chk_1                   │                  chk_2                  │       question        │   v1    │   v2    │
│    varchar     │                  varchar                  │                 varchar                 │        varchar        │ varchar │ varchar │
├────────────────┼───────────────────────────────────────────┼─────────────────────────────────────────┼───────────────────────┼─────────┼─────────┤
│ G1_1e9_1e2_5_0 │ 2849922064.0;474988853.389                │ 2849922064.0;475124704.0                │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;7600000111.0;47498842805.648 │ 2849922064.0;7600000111.0;47271714816.0 │ sum v1:v3 by id6      │ 0.20.31 │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;7600000111.0;47498842805.648 │ 2849922064.0;7600000111.0;47271718912.0 │ sum v1:v3 by id6      │ 0.20.31 │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;474988853.389                │ 2849922064.0;474947168.0                │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;4749467.632                    │ 28498857.0;4749467.5                    │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;7600000111.0;47498842805.648 │ 2849922064.0;7600000111.0;47271714816.0 │ sum v1:v3 by id6      │ 0.19.8  │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;47500172.658                  │ 284994735.0;47500520.0                  │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;474988853.389                │ 2849922064.0;474947168.0                │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;759971497.0;4750083909.4      │ 284994735.0;759971497.0;4749809152.0    │ sum v1:v3 by id6      │ 0.20.31 │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;4749467.632                    │ 28498857.0;4749467.5                    │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;4749467.632                    │ 28498857.0;4749468.0                    │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;75988394.0;474969574.048       │ 28498857.0;75988394.0;474969856.0       │ sum v1:v3 by id6      │ 0.19.8  │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;47500172.658                  │ 284994735.0;47500004.0                  │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;7600000111.0;47498842805.648 │ 2849922064.0;7600000111.0;47271718912.0 │ sum v1:v3 by id6      │ 0.19.8  │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;4749467.632                    │ 28498857.0;4749468.0                    │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;47500172.658                  │ 284994735.0;47500004.0                  │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e7_1e2_5_0 │ 28498857.0;75988394.0;474969574.048       │ 28498857.0;75988394.0;474969856.0       │ sum v1:v3 by id6      │ 0.20.31 │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;47500172.658                  │ 284994735.0;47500520.0                  │ sum v1 mean v3 by id3 │ 0.20.31 │ 1.0.0   │
│ G1_1e8_1e2_5_0 │ 284994735.0;759971497.0;4750083909.4      │ 284994735.0;759971497.0;4749809152.0    │ sum v1:v3 by id6      │ 0.19.8  │ 1.0.0   │
│ G1_1e9_1e2_5_0 │ 2849922064.0;474988853.389                │ 2849922064.0;475124704.0                │ sum v1 mean v3 by id3 │ 0.19.8  │ 1.0.0   │
├────────────────┴───────────────────────────────────────────┴─────────────────────────────────────────┴───────────────────────┴─────────┴─────────┤
│ 20 rows                                                                                                                                6 columns │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

ritchie46 · 2024-07-04T15:54:58Z

I am not sure the previous versions are more "correct". Is there an exact correct answer? This might be due to associativity reasons.

ritchie46 · 2024-07-05T06:04:42Z

@Tmonster we switched from float64 to float32. I think this the reason for the differences you see.

Tmonster · 2024-07-05T11:52:22Z

I'm running the benchmark again because I couldn't manage to reproduce the issue locally. Will update in an hour or so

Tmonster · 2024-07-05T15:41:06Z

I am not sure the previous versions are more "correct". Is there an exact correct answer? This might be due to associativity reasons.
There isn't an exact correct answer, since with floating point arithmetic we allow for some tolerance. I think most of these results are within the tolerance.

I re-ran the benchmark, but am still getting invalid results. I'll try to narrow down more the exact question and dataset since the most recent results don't produce the same table as above. I can publish partial results, but I imagine you want to wait until all results are in?

ritchie46 · 2024-07-05T16:22:08Z

I can publish partial results

It at least would make more sense than the current results. :P

We test against these queries in our CI, so I am very surprised that there is something different, except for the floating point change. Did you run the new float32 on older versions as well? Or do you compare against older data?

ritchie46 added 2 commits June 17, 2024 09:14

rechunk and write string views

81ac3b2

use same float types as duckb

d8b455a

szarnyasg requested a review from Tmonster June 22, 2024 08:52

Tmonster approved these changes Jun 26, 2024

View reviewed changes

merge

09ea6b1

Tmonster merged commit 02159fb into duckdblabs:master Jun 27, 2024
15 checks passed

eitsupi mentioned this pull request Jul 5, 2024

feat: Add "future" versioning pola-rs/polars#17421

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Polars: rechunk and write string views #79

Polars: rechunk and write string views #79

ritchie46 commented Jun 17, 2024 •

edited

Loading

Tmonster left a comment •

edited

Loading

ritchie46 commented Jun 26, 2024

Tmonster commented Jun 27, 2024

ritchie46 commented Jul 1, 2024

Tmonster commented Jul 2, 2024

Tmonster commented Jul 4, 2024

ritchie46 commented Jul 4, 2024

ritchie46 commented Jul 5, 2024

Tmonster commented Jul 5, 2024

Tmonster commented Jul 5, 2024

ritchie46 commented Jul 5, 2024

Polars: rechunk and write string views #79

Polars: rechunk and write string views #79

Conversation

ritchie46 commented Jun 17, 2024 • edited Loading

Tmonster left a comment • edited Loading

Choose a reason for hiding this comment

ritchie46 commented Jun 26, 2024

Tmonster commented Jun 27, 2024

ritchie46 commented Jul 1, 2024

Tmonster commented Jul 2, 2024

Tmonster commented Jul 4, 2024

ritchie46 commented Jul 4, 2024

ritchie46 commented Jul 5, 2024

Tmonster commented Jul 5, 2024

Tmonster commented Jul 5, 2024

ritchie46 commented Jul 5, 2024

ritchie46 commented Jun 17, 2024 •

edited

Loading

Tmonster left a comment •

edited

Loading