Skip to content

Rust Polars 0.45.0

Compare
Choose a tag to compare
@github-actions github-actions released this 08 Dec 11:16
· 207 commits to main since this release
58a38af

💥 Breaking changes

  • Remove dedicated sink_(parquet/ipc)_cloud functions (#20164)
  • Experimental cloud write support (#20129)

🚀 Performance improvements

  • Add fast paths for series.arg_sort and dataframe.sort (#19872)
  • Utilize the RangedUniqueKernel for Enum/Categorical (#20150)
  • Reduce memory copy when scanning from Python objects (#20142)
  • Don't instantiate validity mask when unneeded in Parquet (#20149)
  • Expand more filters (#20022)
  • Cache the DataFrame schema in get_column_index (#20021)
  • Reduce the size of row encoding UTF-8 (#19911)
  • Memoize duplicates in rolling-gb-dyn (#19939)
  • More efficient row encoding for pl.List (#19907)
  • Half the size of Booleans in row encoding (#19927)
  • Rolling 'iter_lookbehind' breeze through duplicates (#19922)
  • Initially trim leading and trailing filtered rows (#19850)
  • Increase default async thread count for low core count systems (#19829)
  • Move row group decode off async thread for local streaming parquet scan (#19828)
  • Support use of Duration in to_string, ergonomic/perf improvement, tz-aware Datetime bugfix (#19697)
  • Improve DataFrame.sort().limit/top_k performance (#19731)
  • Improve cloud scan performance (#19728)
  • Fix quadratic 'with_columns' behavior (#19701)
  • Improve hive partition pruning with datetime predicates from SQL (#19680)
  • Allow for arbitrary skips in Parquet Dictionary Decoding (#19649)
  • Reorder conditions in is_leap_year (#19602)
  • Rechunk in DataFrame.rows if needed (#19628)
  • Dispatch Parquet Primitive PLAIN decoding to faster kernels when possible (#19611)
  • Use faster iteration in 'starts_with'/'ends_with' (#19583)
  • Branchless Parquet Prefiltering (#19190)

✨ Enhancements

  • Retry with reloaded credentials on cloud error (#20185)
  • Support reading Enum dtype from csv (#20188)
  • Allow sorting of lists and arrays (#20169)
  • Add maintain_order parameter to joins (#20026)
  • Allow for to_datetime / strftime to automatically parse dates with single-digit hour/minute/second (#20144)
  • Experimental cloud write support (#20129)
  • Allow setting and reading custom schema-level IPC metadata (#20066)
  • Add optimized row encoding for Decimals (#20050)
  • Add drop_nans method to DataFrame and LazyFrame (#20029)
  • Catch use of 'polars' in to_string for non-Duration dtypes and raise an informative error (#19977)
  • Add AhoCorasick backed 'find_many' (#19952)
  • Speed up starts_with for small prefixes (#19904)
  • Auto-enable hive partitioning if hive_schema was given (#19902)
  • Add pl.concat_arr to concatenate columns into an Array column (#19881)
  • Support both "iso" and "iso:strict" format options for dt.to_string (#19840)
  • Add rounding for Decimal type (#19760)
  • Improved array arithmetic support (#19837)
  • Raise informative error on Unknown unnest (#19830)
  • Support use of Duration in to_string, ergonomic/perf improvement, tz-aware Datetime bugfix (#19697)
  • Allow specification of chunk_size on LazyCsvReader.read_options (#19819)
  • Add an is_literal method to expression meta namespace (#19773)
  • A different approach to warning users of fork() issues with Polars (#19197)
  • Add dylib (#19759)
  • Add IPC source node for new streaming engine (#19454)
  • Implement max/min methods for dtypes (#19494)
  • Improve hive partition pruning with datetime predicates from SQL (#19680)
  • Parallel IPC sink for the new streaming engine (#19622)
  • Add SQL support for RIGHT JOIN, fix an issue with wildcard aliasing (#19626)
  • Add show_graph to display a GraphViz plot for expressions (#19365)

🐞 Bug fixes

  • Don't trigger length check in array construction (#20205)
  • Allow row encoding for 32-bit architectures (e.g. WASM) (#20186)
  • Properly project unordered column in parquet prefiltered (#20189)
  • Csv stop simd cache if eol char is hit (#20199)
  • Estimated size for object (#20191)
  • Respect parallel argument in parquet (#20187)
  • Only validate UTF-8 for selected items when all below len 128 (#20183)
  • Serialize categories of Enum in arrow metadata (#20181)
  • Don't use RLE encoding for Parquet Boolean (#20172)
  • Invalid bitwise_xor for ScalarColumn (#20140)
  • Add temporal feature gate in is_elementwise_top_level (#20177)
  • Column name mismatch or not found in Parquet scan with filter (#20178)
  • Raise if apply returns different types (#20168)
  • Deal with masked out list elements (#20161)
  • Fix index out of bounds in uniform_hist_count (#20133)
  • Implement arg_sort for Null series (#20135)
  • Handle slice pushdown in PythonUDF GroupBy (#20132)
  • Check shape for *_horizontal functions (#20130)
  • Properly coerce types in lists (#20126)
  • Incorrect aggregation of empty groups after slice (#20127)
  • DataFrame .get_column after drop_in_place (#20120)
  • Subtraction with underflow on empty FixedSizeBinaryArray (#20109)
  • Materialize smallest dyn ints to use feature gate for i8/i16 (#20108)
  • Return null instead of 0. for rolling_std when window contains a single element and ddof=1 and there are nulls elsewhere in the Series (#20077)
  • Only slice after sort when slice is smaller than frame length (#20084)
  • Preserve Series name in __rpow__ operation (#20072)
  • Allow nested is_in() in when()/then() for full-streaming (#20052)
  • Fix datetime cast behavior for pre-epoch times (#19949)
  • Improve hist binning around breakpoints (#20054)
  • Fix invalid len due to projection pushdown selection of scalar (#20049)
  • Fix empty scalar agg type (#20051)
  • Improve binning in Series.hist with bin_count when all values are the same (#20034)
  • Less intrusive forking warnings (#20032)
  • Reading nullable sliced / masked Categoricals from Parquet (#20024)
  • Regression in hist panicking on out of bounds index (#20016)
  • Fix starts_with out of bounds (#20006)
  • Fix incorrect column order for parquet scan with hive columns in file (#19996)
  • Incorrectly gave list.len() for masked-out rows (#19999)
  • Bug fix in existing fast path for sorted series (#20004)
  • Incorrect collect_schema() for fill_null() after an aggregation expression in group-by context (#19993)
  • Fix Decimal type fill_null (#19981)
  • Fix panic on schema merge for prefiltering (#19972)
  • Fix lazy frame join expression (#19974)
  • Fix gather_every for Scalar (#19964)
  • Toggle 'fast_unique' on new_from_index (#19956)
  • Raise proper error message when too small interval is passed to datetime_range (#19955)
  • Fix scalar object (#19940)
  • Raise InvalidOperationError for invalid float to decimal casts (e.g. Inf, NaN) (#19938)
  • Fix panic with combination of hive and parquet prefiltering (#19905)
  • Fix panic when joining with empty frame (debug only) (#19896)
  • Fix incorrect result from inequality filter after join on LazyFrame (#19898)
  • Misleading ShapeError error message on dataframe creation (#19901)
  • Fix panic with empty delta scan, or empty parquet scan with a provided schema (#19884)
  • Ensure type object of inputs for cached any-value conversion functions are kept alive (#19866)
  • Fix panic using scan_parquet().with_row_index() with hive partitioning enabled (#19865)
  • Improve histogram bin logic (#18761)
  • Raise informative error instead of panicking for list arithmetic on some invalid dtypes (#19841)
  • Properly handle Zero-Field Structs in row encoding (#19846)
  • Incorrect explode schema for LazyFrame.explode() (#19860)
  • Ensure List element truncation ellipses respect ASCII* table formats (#19835)
  • Validate subnodes in validate IR (#19831)
  • Raise if merge non-global categoricals in unpivot (#19826)
  • Type hints for window_size incorrectly included timedelta in some rolling functions (#19827)
  • Don't panic if column not found (#19824)
  • Fix gather of Scalar null + idx w/ validity (#19823)
  • Fix object chunked gather (#19811)
  • Fix inconsistency between code and comment (#19810)
  • Fix filter scalar nulls (#19786)
  • Altair tooltip was being incorrectly applied to plots which did not accept it (#19789)
  • Fix scanning google cloud with service account credentials file (#19782)
  • Fix incorrect filter after right-join on LazyFrame (#19775)
  • Fix incorrect lazy schema for explode on array columns (#19776)
  • Fix incorrect lazy schema for aggregations (#19753)
  • Fix validation for inner and left join when join_nulls unflaged (#19698)
  • SQL ELSE clause should be implicitly NULL when omitted (#19714)
  • In group_by_dynamic, period and every were getting applied in reverse order for the window upper boundary (#19706)
  • Only allow list.to_struct to be elementwise when width is fixed (#19688)
  • Make Array arithmetic ops fully elementwise (#19682)
  • Update line-splitting logic in batched CSV reader (#19508)
  • Fix incorrect lazy schema for explode() in agg() (#19629)
  • Fix filter incorrectly pushed past struct unnest when unnested column name matches upper column name (#19638)
  • Ensure mean_horizontal raises on non-numeric input (#19648)
  • Reorder conditions in is_leap_year (#19602)
  • Copy height in .vstack() for empty dataframes (#19641) (#19642)
  • Run join type coercion with correct schemas active (#19625)
  • Correct wildcard and input expansion for some more functions (#19588)
  • Allow .struct.with_fields inside list.eval (#19617)
  • Sortedness was incorrectly being preserved in dt.offset_by when offsetting by non-constant durations in the timezone-naive case (#19616)
  • Fix incorrect scan_parquet().with_row_index() with non-zero slice or with streaming collect (#19609)
  • Fix mask and validity confusion in Parquet String decoding (#19614)
  • Parquet decoding of nested dictionary values (#19605)
  • Do not attempt to load default credentials when credential_provider is given (#19589)
  • Fix gather len in group-by state (#19586)
  • Added input validation for explode operation in the array namespace (#19163)
  • Improve error message (#19546)
  • Fix predicate pushdown into inequality joins (#19582)

📖 Documentation

  • Add more Rust examples to User Guide (#20194)
  • Expand plotting docs (#19719)
  • Fix Rust examples in user guide (#20075)
  • Update by param description for rolling_*_by functions (#19715)
  • Fix inconsistency between code and comment (#20070)
  • Correct supported compression formats (#20085)
  • Specify strictness in cast (#20067)
  • Fix broken links to user guide (#19989)
  • Minor doc fixes and cleanup (#19935)
  • Complete parameters description and add an example for clip() (#19875)
  • Fix some warnings during docs build (#19848)
  • Change dprint config (#19747)
  • Fix formatting of nested list (#19746)
  • Add meta.is_column to API docs (#19744)
  • Fix join API reference links (#19745)
  • Revise and rework user-guide/expressions (#19360)
  • Update Excel page of user guide to refer to fastexcel as the default engine (#19691)
  • Alter examples for round_sig_figs to make behaviour clearer (#19667)
  • Assorted fixes to Rust API docs (#19664)
  • Improve replace and replace_all docstring explanation of the "$" character with reference to capture groups (vs use as a literal) (#19529)

📦 Build system

  • Upgrade sqlparser-rs from version 0.49 to 0.52 (#20110)
  • Bump memmap2 to version 0.9 (#20105)
  • Bump object_store to version 0.11 (#20102)
  • Bump fs4 to version 0.12 (#20101)
  • Fix path to polars-dylib crate in workspace (#20103)
  • Bump thiserror to version 2 (#20097)
  • Bump atoi_simd to version 0.16 (#20098)
  • Bump chrono-tz to 0.10 (#20094)
  • Update Rust dependency ndarray to 0.16 (#20093)
  • Bump Rust toolchain to nightly-2024-11-28 (#20064)
  • Pin maturin (#20063)
  • Use public windows runners in python release (#19982)
  • Add windows-aarch64 to python binaries (#19966)

🛠️ Other improvements

  • Deprecate ddof parameter for correlation coefficient (#20197)
  • Move Bitwise aggregations to FunctionExpr (#20193)
  • Add ragged lines test (#20182)
  • Remove dedicated sink_(parquet/ipc)_cloud functions (#20164)
  • Move new-streaming parquet and CSV sources to under io_sources/ (#20160)
  • Move horizontal methods to polars-ops (#20134)
  • Remove useless SeriesTrait::get implementations (#20136)
  • Add a bunch more automated row encoding sortedness tests (#20056)
  • Replace custom PushNode trait with Extend (#20107)
  • Update AWS doc dependencies (#20095)
  • Move cast from polars-arrow to polars-compute (#19967)
  • Implement nested row encoding / decoding (#19874)
  • Remove use of cast in ArrowArray::new (#19899)
  • Switch back to PyO3 0.22 (#19851)
  • Make chunked gathers generic over chunk bit width (#19856)
  • Add proper tests for row encoding (#19843)
  • Add ToField context for common args (#19833)
  • Add new streaming CSV source (#19694)
  • Add BytesIndexMap and use in RowEncodedHashGrouper (#19817)
  • Use HashKeys abstraction (#19785)
  • Migrate polars-expr AggregationContext to use Column (#19736)
  • Add InMemoryJoin to new-streaming engine (#19741)
  • Use Column for the {try,}_apply_columns{_par,} functions on DataFrame (#19683)
  • Remove more @scalar-opt (#19666)
  • Move Series bitops to std::ops::Bit... (#19673)
  • Mark test_parquet.py test_dict_slices as slow (#19675)
  • Get Column into polars-expr (#19660)
  • Remove unused file (#19661)
  • Delegate feature flags for polars-stream (#19659)
  • Streamline internal SQL join condition processing (#19658)
  • Factor out logic for re-use by new streaming CSV source (#19637)
  • Configure grouped Dependabot updates (#19604)
  • Share source token between all sender tasks of source nodes in new-streaming engine (#19593)
  • Fix PyO3 error in CI (#19545)
  • Update nightly compiler version (#19590)
  • Added input validation for explode operation in the array namespace (#19163)
  • Remove MutableStructArray (#19587)
  • Fix lint (#19584)
  • Add a Column::Partitioned variant (#19557)
  • Move to fast-float2 (#19578)
  • Only run remote bench on rust changes (#19581)

Thank you to all our contributors for making this release possible!
@3tilley, @DzenanJupic, @MarcoGorelli, @TNieuwdorp, @YichiZhang0613, @alexander-beedie, @barak1412, @braaannigan, @cmdlineluser, @coastalwhite, @corwinjoy, @dependabot, @dependabot[bot], @eitsupi, @engylemure, @etiennebacher, @flowlight0, @gab23r, @henryharbeck, @iharthi, @iliya-malecki, @ion-elgreco, @itamarst, @jackxxu, @janpipek, @jqnatividad, @letkemann, @lukapeschke, @lukemanley, @max-muoto, @mcrumiller, @mhogervo, @nameexhaustion, @orlp, @ptiza, @ritchie46, @rodrigogiraoserrao, @siddharth-vi, @sn0rkmaiden, @stijnherfst, @stinodego, @wence- and @wsyxbcl