-
-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: use public pandas API in dataframe.empty #571
Conversation
Thanks for taking an interest and helping out here! Unfortunately, the existence of this function and the convoluted code it contains, is because of pandas' poor performance in forming dataframes to be assigned into. Benchmarking, I find: # main branch
In [2]: %timeit out = fastparquet.dataframe.empty('i4,u2,f4,f2,f4,M8,category', 1000000, cats={6: 5})
903 µs ± 86.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# this branch
In [3]: %timeit out = fastparquet.dataframe.empty('i4,u2,f4,f2,f4,M8,category', 1000000, cats={6: 5})
43.6 ms ± 668 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) Fastparquet does not want this factor of 50 slowdown! I really do hope that Pandas itself creates this functionality, so that we don't have to. |
Yep, thats a pretty big slowdown. I'll see what I can do upstream, and take another try at this in the interim. |
Thanks, @jbrockmendel ! Interestingly, I wrote about this all the way back in 2017. @jreback helped write this initially, and @TomAugspurger has helped keep this module in sync with pandas changes. |
Updated with an implementation that slightly outperforms the status quo
This still accesses internals in a way that is not officially supported, but should be at least future-proof to pandas-dev/pandas#40149 (though not ArrayManager) |
Perfect - that persuades me. I'll leave this a day or so to see if there are other ocmments. |
Thanks @jbrockmendel ! |
Some of what I'm working on in pandas would be simplified if this made it into a released version of fastparquet. Is that on the horizon? |
Not in the next day or two, but yes. |
new_block = block.make_block_same_class(values=values) | ||
# Note: this will break on any ExtensionDtype other than | ||
# Categorical and DatetimeTZ | ||
values = np.empty(shape=shape, dtype=bvalues.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the fix to pandas-dev/pandas#41366 (comment) is to add here
if not isinstance(bvalues, np.ndarray):
# e.g. DatetimeLikeBlock backed by DatetimeArray/TimedeltaArray
values = type(bvalues)._from_sequence(values)
i'll try to reproduce the problem locally after some caffeine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into it. Happy to push a point release to fix pandas' CI, if needed.
I hope the line above doesn't make a copy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if its a 3rd-party EA then there's no telling, but for the dt64 case this should be copy-free
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reproduced the failure locally and the edit above fixes 2 of the 8 tests. the others look pytz-related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It it's a copy, fastparquet will be filling out an array that's no longer connected to the actual dataframe storage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It it's a copy, fastparquet will be filling out an array that's no longer connected to the actual dataframe storage?
No, correctness wouldn't be affected.
The current usage of non-public API is breaking the docbuild on a pandas PR: https://github.com/pandas-dev/pandas/pull/40149/checks?check_run_id=2102847175#step:5:152
See also: pandas-dev/pandas#40226
A slightly more performant implementation may become possible following pandas-dev/pandas#39776