[Epic]: Complete ROW Format (Missing features) #1861
Labels
datafusion
Changes in the datafusion crate
development-process
Related to development process of DataFusion
enhancement
New feature or request
performance
Make DataFusion faster
Goal: a complete row implementation, fully used in pipeline breaker operators when possible.
Summary
TLDR: The key focus of this work is to speed up fundamentally row oriented operations like hash table lookup or comparisons (e.g. #2427)
Background
DataFusion, like many Arrow systems, is a classic "vectorized computation engine" which works quite well for many common operations. The following paper, gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de
As mentioned in the paper, there are some fundamentally "row oriented" operations in a database that are not typically amenable to vectorization. The "classics" are: Hash table updates in Joins and Hash Aggregates, as well as comparing tuples in sort.
When operating with a Row based format, the per-tuple type dispatch overhead becomes quite important, so such operations are typically implemented using just in time compilation (JIT) or other unsafe mechanims to minimize the overhead
@yjshen added initial support for JIT'ing in #1849 and it currently lives in https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. He also added partial support for aggregates in #2375
This ticket tracks the remaining work to fully support row formats, including JIT'ing
Getters and setters
date64
as an example.Formats
Hook into execution (mainly the pipeline-breakers)
row_hash.rs
andhash.rs
(remove duplication) #2723Cleanups
JIT
The text was updated successfully, but these errors were encountered: