Releases: machow/siuba
Releases · machow/siuba
Experimental Symbolic autocompletion
Fix lhs ops, support kwargs in sql count
Small fix for summarize, w/ Series results
See issue #138. This release ensures summarize...
- validates results are scalar or length 1.
- uses a Series results underlying array, to issues around Series indexes in DataFrame construction.
Small update for docs: Call.map_replace and cars data
This is a small release, designed to support the new siuba documentation.
Features
- added Call.map_replace method, which is like map_subcall but replaces subcalls with the result
- added to siuba.data: cars, cars_sql
top_n, floor_date, custom sql joins, and full method spec
Fixes
- filter now preserves column order, rather than moving grouping columns to left (#205)
- symbolic representations now correctly align on keywords (#222)
Features
- sql supports custom join conditions via sql_on (#202)
- siuba.series.spec now includes all Series methods, even unsupported ones (#209)
- the spec also now is derived from the file
siuba/series/spec.yml
(#211) - siu Symbolic is no longer falsey (#210)
- added new verb top_n (#222)
- added vector functions ceil_date and floor_date to siuba.experimental.datetime (#222)
QA
- re-enabled testing of example jupyter notebooks (#206)
Add fct_lump prop argument, fix fast grouped summarize
fix if_else, remove psycopg2 dependency
Fix nest function to support pandas v1.0.0
Support for user defined functions (UDFs)
New Feature: support user defined functions (#146)
- Support for user defined functions (UDFs). Note that these require annotating the return type. For more on the theory behind these see ADR-003.
from siuba.siu import symbolic_dispatch
from pandas.core.groupby import SeriesGroupBy, GroupBy
from pandas import Series
@symbolic_dispatch(cls = Series)
def cummean(x):
"""Return a same-length array, containing the cumulative mean."""
return x.expanding().mean()
@cummean.register(SeriesGroupBy)
def _cummean_grouped(x) -> SeriesGroupBy:
grouper = x.grouper
n_entries = x.obj.notna().groupby(grouper).cumsum()
res = x.cumsum() / n_entries
return res.groupby(grouper)
from siuba import _, mutate
from siuba.data import mtcars
# a pandas DataFrameGroupBy object
g_cyl = mtcars.groupby("cyl")
mutate(g_students, cumul_mean = cummean(_.score))
- Support for many methods in vector.py, using UDFs (#158)
Bug Fixes
- Fix regression where .str wasn't being removed when processing siu expressions for SQL (#159)
- Grouped filter now preserves order
- Verbs now tested to preserve original index (d938ab3)
Tests
- Add many more versions of python and pandas to travis CI test matrix (#161)
Opt-in speedy support for grouped pandas
Features
- Implementation of fast mutate, filter, and summarize using CallTreeLocal (#134). For even just a couple thousand groups, the fast methods are close to optimal hand-written pandas, and the slow versions are almost 1000x slower :o.
- fixed current grouped pandas mutate to preserve row order (#139)
- laid down tests of all supported series methods, currently skipping SQL backends (but ready to go!)
- put up some very basic documentation (#145)
- wrote an ADR on the rational for fast groupby (#135)
Note that CallTreeLocal has new options, allowing it to look up based on chained attributes (e.g. look for an entry named "dt.year", and override custom function calls.).
I still need to finish support for user defined operations and some light siu refactoring.
Breaking changes
- Removed the rm_attr argument from CallTreeLocal, since converting subattrs like
dt.year
will consumedt
anyway (can't imagine a situation where we'd want to keep it, and couldn't do that in the translator function)
Demo
from siuba.experimental.pd_groups import fast_mutate, fast_filter, fast_summarize
from siuba import *
from siuba.data import mtcars
g_cars = mtcars.groupby(['cyl', 'gear'])
fast_mutate(g_cars, _.hp - _.hp.mean())