-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Polars: rechunk and write string views #79
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ritchie,
Thank you for the PR. Apologies for incorrectly testing polars.
I noticed also that the .ipc
files were not being writing to the mounted drive, but rather to the EBS backed instance, so I fixed that as well.
Unfortunately my introduction of the $MOUNT_POINT env variable has caused a minor conflict. Do you mind fixing it? Then I will merge and start another run once #83 is also merged
Ah, yes. I will do a rebase. Thanks for setting the correct writing locations. On another note. I know this benchmark was inherited and always has a bit of vague startup allowances. (E.g. writing to tables/disks, converting string columns to enum/categories). I am curious if all solutions use the tables as is, or that some kind of metadata is used along the way. |
I've looked at a lot of these startup scripts, and a number of them do seem to take the data and convert it to some internal format. Once in the internal format, I'm sure some of these solutions use stats. What I want to prevent, however, is code designed to make the system run faster for specific queries. For example, setting some flag setting for q7 groupby, then disabling it again for q8. |
@Tmonster any update on a rerun? |
Should have time to set it up and run it tonight. Should be able to update the results by the end of the week |
Hi Ritchie, I re-ran the benchmark on polars v1.0.0 and now I am having issues with q3 and q5 producing incorrect results compared to different versions. You can check out my branch here to look at the results, but this line filters out polars results for group by q3 and q5 so the report can run. I wrote a little script to produce some of the different chk values. There is a tolerance, but these results are outside the tolerance Again it seems to happen only with datasets that have null values I don't really have the bandwidth to see what breaking change there was in the recent versions and try to fix it. Is there someone on your team (or you) that could look into this? Again just groupby q3 and q5 are affected.
|
I am not sure the previous versions are more "correct". Is there an exact correct answer? This might be due to associativity reasons. |
@Tmonster we switched from float64 to float32. I think this the reason for the differences you see. |
I'm running the benchmark again because I couldn't manage to reproduce the issue locally. Will update in an hour or so |
I re-ran the benchmark, but am still getting invalid results. I'll try to narrow down more the exact question and dataset since the most recent results don't produce the same table as above. I can publish partial results, but I imagine you want to wait until all results are in? |
It at least would make more sense than the current results. :P We test against these queries in our CI, so I am very surprised that there is something different, except for the floating point change. Did you run the new float32 on older versions as well? Or do you compare against older data? |
Polars has changed and the current benchmark run doesn't run correctly anymore. On every benchmark run the memory map triggers a full rechunk and deserialization from arrow large string to string view, which isn't what is benchmarked here.
Can you re-run Polars with these changes and latest release?
P.S. I noted that duckb creates the tables with
Float
data type, so I updated Polars to useFloat32
instead ofFloat64
to make it apples to apples. I think we should ensure that all solutions use similar data types.