-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix polars #270
Fix polars #270
Conversation
), | ||
( | ||
"Q2", | ||
"SELECT SUM(AdvEngineID), COUNT(*), AVG(ResolutionWidth) FROM hits;", | ||
lambda x: (x["AdvEngineID"].sum(), x.shape[0], x["ResolutionWidth"].mean()), | ||
lambda x: (x["AdvEngineID"].sum(), x.height, x["ResolutionWidth"].mean()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are 2 queries, where in SQL it is in a single query.
), | ||
( | ||
"Q3", | ||
"SELECT AVG(UserID) FROM hits;", | ||
lambda x: x["UserID"].mean(), | ||
lambda x: x["UserID"].mean(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not lazy suitable. Forces a single engine. It is meant for notebooks.
@@ -318,15 +318,13 @@ | |||
.group_by("CounterID") # GROUP BY CounterID | |||
.agg( | |||
[ | |||
pl.col("URL") | |||
.map_elements(lambda y: len(y), return_dtype=pl.Int64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was wrong. Didn't compute the AVG
and it uses a python function, where we have native expressions for this.
@@ -352,18 +350,14 @@ | |||
.group_by("k") | |||
.agg( | |||
[ | |||
pl.col("Referer").map_elements( | |||
lambda y: len(y), return_dtype=pl.Int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was wrong. Didn't compute the AVG
and it uses a python function, where we have native expressions for this.
stop = timeit.default_timer() | ||
load_time = stop - start | ||
|
||
# 0: No., 1: SQL, 2: Pandas, 3: Polars | ||
queries = queries = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know why this was.
This comment was marked as resolved.
This comment was marked as resolved.
@rschu1ze Yeah, that's a pandas requirement. I don't know why there is pandas in the Polars benchmark anyway. 😅 I went ahead and removed the pandas code here. We can load parquet with Polars directly without an extra pandas or pyarrow requirement. I verified I can run (on a sliced dataset) locally. The total dataset is too big for my machine. If there is interest I can follow up with a PR where we query directly from parquet/csv or other sources. |
import json | ||
|
||
hits = pd.read_parquet("hits.parquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can load parquet with Polars. No need to go via pandas. I also think that load time should reflect loading from disk instead of loading from pandas. Loading from pandas is mostly zero-copy.
28fbc0b
to
d12bfb6
Compare
) | ||
.head(25) | ||
), | ||
"SELECT REGEXP_REPLACE(Referer, '(?-u)^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(STRLEN(Referer)) AS l, COUNT(*) AS c, MIN(Referer) FROM hits WHERE Referer <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Polars uses to rust Regex which defaults to unicode supported regex, whereas most engines don't. This can have huge search cost differences:
Yes! and if you can run the benchmark c6a.4xlarge, 500gb gp2 (like in datafusion) it would be great!
|
This relates to #264. I actually tried to run polars/benchmark.sh on my corporate c6a.4xlarge machine (using the version in this PR, now that the setup is fixed) ... and it froze my machine so badly that I was unable to connect via SSH or ping it for 2.5 hours, after which I killed the instance forcefully. Perhaps someone else has more luck. |
Yes, this version loads all data in memory which doesn't fit that machine. This was already the case, and I didn't adapt that in the PR. I can also follow up with a PR where we load data from parquet. That doesn't materialize all data and is much less memory intensive. |
That would be nice. Actually, we should try to do changes to the benchmark scripts / queries and the results themselves in the same PR. Otherwise, one of them will be outdated. So I guess my preference is if we could update all polars variants within this PR (and if it helps, I can setup myself a |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
@rschu1ze I have added parquet as a source as well. Made it so that we create 2 output jsons, indicating the source used. Note that running from parquet has drastically reduced memory usage. The DataFrame case is just very intensive as we are forced to load all data in memory. |
@ritchie46 Thanks ... I'll re-run measurements in the coming days! |
Great, thanks! |
I both dataframes and parquet versions on c6a.4xlarge again but the system hang. Didn't check deeper why, instead I repeated the measurements on a c6a.metal machine. Pushed the measurements into this PR. |
@rschu1ze, the server's freeze means OOM. Please run it with Some queries will be shown as failed, but the results are still good to publish (and then they can be improved). |
Done: #279 |
Hey, I am from the Polars project and was surprised Polars was added to ClickBench. When I looked at the implementation I saw that Polars was incorrectly used (#268). and sometimes even python functions were called where we have a native API for the operations.
This ensures that
map_elements
with python udfsResolves: #268