You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The polars-row subcrate provides the row-encoding is used by Polars. It is now in a very bare state and should be improved to provide better possibilities in sorting, joins and the new streaming engine.
Here is a list of improvements that I would like to look into:
Improve pl.List encoding with continuation tokens instead of variable length encoding. This allows empty child encoding, removes the need for an intermediate buffer, and massively reduces the amount of space needed by the row encoding. I do need to verify that it fully works 😅. (perf: More efficient row encoding for pl.List #19907)
Properly implement Dictionary encoding (to be used by pl.Enum and pl.Categorical), using the method described by the arrow-row people. We need to investigate how to roundtrip this efficiently (e.g. with a bidirectional HashMap). We implemented our own encoding that does not rely on the tricks discovered by the arrow people.
Implement optimizations for BinaryView to be similar to Dictionary when the cardinality of views is low. We need to find a way to estimate this cardinality. It might also be worth it to consider the average length of a view. This is for instance probably not really worth it if all views are inlinable anyway.
The
polars-row
subcrate provides the row-encoding is used by Polars. It is now in a very bare state and should be improved to provide better possibilities in sorting, joins and the new streaming engine.Here is a list of improvements that I would like to look into:
pl.List
encoding with continuation tokens instead of variable length encoding. This allows empty child encoding, removes the need for an intermediate buffer, and massively reduces the amount of space needed by the row encoding. I do need to verify that it fully works 😅. (perf: More efficient row encoding forpl.List
#19907)Properly implementWe implemented our own encoding that does not rely on the tricks discovered by the arrow people.Dictionary
encoding (to be used bypl.Enum
andpl.Categorical
), using the method described by thearrow-row
people. We need to investigate how to roundtrip this efficiently (e.g. with a bidirectionalHashMap
).BinaryView
to be similar toDictionary
when the cardinality of views is low. We need to find a way to estimate this cardinality. It might also be worth it to consider the average length of a view. This is for instance probably not really worth it if all views are inlinable anyway.Column
encoding so thatScalarColumn
andPartedColumn
encoding becomes much cheaper.Boolean
to at most one byte. (perf: Half the size of Booleans in row encoding #19927)The text was updated successfully, but these errors were encountered: