-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pandas Summarize needs to validate each result #138
Comments
Was pairing w/ @mariekers, who hit this issue, so probably time to fix. Her specific issue was that if you do a elementwise operation that just so happens to work like an aggregation, summarize breaks. For example, import pandas as pd
from siuba import _, summarize
# note that all groups only have 1 row
df = pd.DataFrame({'g': ['a', 'b', 'c'], 'x': [1, 2, 3], 'y': [4,5,6]})
summarize(df.groupby('g'), res = _.x / _.y) Returns
which is pretty embarrassing. Note that fast_summarize raises an informative error, which is way better. For slow summarize, we should just use the result's underlying array (to avoid the assumption that the result's index is [0]). |
I think ultimately the main challenge here is performance. Within each grouping, we might calculate three columns...
Then need to construct a DataFrame
But this will produce crazy results
One way out is to strip out the underlying arrays...
All this finagling is not great for speed, but is exactly why the fast_summarize function exists: it plans out how to handle operations beforehand, so you can plan once per summarize argument, rather than post-hoc instance checking. So might as well instance check here, and prioritize getting fast summarize out of experimental + user testing and documenting it. (plus, similar issues will arise and are likely dealt with in pandas transform etc..) |
Currently, if an expression passed to summarize produces a Series of length 3, only the first entry will be kept. This is because summarize creates a DataFrame with index = [0], and pandas uses this index subset the Series result.
Could do a simple length check on the Series.
Once this is complete, letting Summarize refer back to just-defined columns seems doable... (#20)
Example
The text was updated successfully, but these errors were encountered: