-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized RecordBatch
for constant columns
#1248
Comments
I'm for this approach. In fact we can allow several other types that are memory efficient:
basically |
Another thing to contemplate is support for RLE (aka store runs of repeated values) A --> (A, 3) |
Tangentially related, since I have been investigating something related: the I think that the two requirements for something to be consistent with the arrow format are:
I do not think RLE and other encodings fulfill these requirements (not sure whether this is an issue, though). |
👍 I agree and I think keeping a separation between what is generally useful for users of Arrow libraries and what is more likely to be more narrowly useful to DataFusion is important. For example, I think it would be fine to add specialized constant column or RLE support in DataFusion, but that likely doesn't belong in an arrow implementation |
maybe something like this just to see how much change is needed |
I had a further thought on this and believe having
|
I think instead of returning RLE can be something we use to further compress arrow record batches, but I think it is an overkill for constant columns because this use-case can already be covered by |
Sequence is one example |
Recently I've tried to make a stab at this. As |
I would indeed expect it to be a substantial change across the codebase. I don't have any magic suggestions to reduce the potential impact |
I think we have ColumnarValue and it seems to work well -- I am not sure this idea is worth pursuing anymore so closing for now |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Certain record batches have columns that entirely the same value -- for example the partitioning column in #1141
In that case the
ListingProvider
adds a column with a constant value based on the path in the file systemSo a file in a path like
/foo/date=2021-11-5/bar.parquet
would end up producing one or more RecordBatches with adate
column that had the value2021-11-5
.At the moment, to represent such a column using an Arrow array we have to repeat the value
2021-11-5
over and over again which is inefficient both in space storage as well as in processing time.Describe the solution you'd like
It seems that having a sort of RecordBatch that can contain constant columns represented by a scalar value (like the C++ ExecBatch) is a pretty nice abstraction that will help adding tons of optimizations. I didn't find an issue summarizing this discussion and conclusion, should we create one?
According to @westonpace , in C++ there is
ExecBatch
in addition toRecordBatch
for this purpose.ExecBatch
is used within the exec plan and it's pretty much identical to record batch except:Note DataFusion already has a Array/Scalar notion in
ColumnarValue
https://github.com/apache/arrow-datafusion/blob/a1c794cec233f7fe34f34c7d64f529625a507669/datafusion/src/physical_plan/mod.rs#L388-L395 but its use is confined to expression evaluation at the moment.Perhaps we could build something like
DFRecordBatch
akin toDFSchema
(aka wrap the arrowRecordBatch
with methods that allow the columns to beColumnarValue
)Describe alternatives you've considered
We could use a
Dictionary(Int8, Utf8)
array to do better (where each repeated string value only requires a singlei8
(8 bytes) but it still requires ~ 8 bytes * the number of total rows in theRecordBatch
Another classic approach I have seen for this kind of problem is to use RLE encoding so that these “almost entirely constant columns” become a very small number of RLE runs.
I think RLE has been proposed for Arrow several times before, but it doesn’t have the “access array[i]” in constant time” property required for arrays
Additional context
(this is my transcription of some information in an ASF slack thread: https://the-asf.slack.com/archives/C01QUFS30TD/p1634810772007000)
cc @houqp @rdettai @Dandandan
The text was updated successfully, but these errors were encountered: