-
Notifications
You must be signed in to change notification settings - Fork 33
[FYI] Filtering Benchmark #138
Comments
More benchmarks: In [49]: pk_cols = ['data_provider', 'weather_station', 'weather_variable','issue_date_utc', 'value_date_utc']
In [50]: pandas_df = pandas_df.set_index(pk_cols)
In [51]: fletcher_df = fletcher_df.set_index(pk_cols)
In [56]: %timeit pandas_df['value'].mean()
36.3 ms ± 557 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [57]: %timeit fletcher_df['value'].mean()
45 ms ± 330 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) Interestingly pandas has some sort of performance bug for sum! In [54]: %timeit pandas_df['value'].sum()
267 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [55]: %timeit fletcher_df['value'].sum()
44.8 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [60]: %timeit (pandas_df['value']*pandas_df['value'])
57.8 ms ± 1.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [61]: %timeit (fletcher_df['value']*fletcher_df['value'])
127 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) |
These are of interest, it would though be better to make several issues instead of a collector one (you can keep it as a meta-issue). Then we can work through them one-by-one and make it slowly faster. Most of the behaviour is expected but not all:
|
Arrow's take is slower than numpy's take, I have noticed earlier, so that might be related (and Wes is speeding up take right now)
Or compare with the nullable integer dtyped column, where we got rid of this nan-handling penalty (although this is only in master) |
I might do that once I've written a script to generate some dummy data so others can repro it. I do actually have
|
This project has been archived as development has ceased around 2021. |
If I convert a
pa.Table
to apandas
DataFrame
I pay for the cost of conversion up front but then it seems operations such as filtering are 2x faster than on afletcher
DataFrame
:Just posting here in case benchmarks on real data are of interest.
The text was updated successfully, but these errors were encountered: