-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a RecordBatch::split to split large batches into a set of smaller batches #343
Comments
We also have |
Indeed -- I think that is how @tustvold implemented fn split_batch(sorted: &RecordBatch, batch_size: usize) -> Vec<RecordBatch> {
let batches = (sorted.num_rows() + batch_size - 1) / batch_size;
// Split the sorted RecordBatch into multiple
(0..batches)
.into_iter()
.map(|batch_idx| {
let columns = (0..sorted.num_columns())
.map(|column_idx| {
let length =
batch_size.min(sorted.num_rows() - batch_idx * batch_size);
sorted
.column(column_idx)
.slice(batch_idx * batch_size, length)
})
.collect();
RecordBatch::try_new(sorted.schema(), columns).unwrap()
})
.collect()
} |
Reminder that slice currently doesn't work (or work correctly) for lists. So we have to be careful with how we use it. It's sadly a limitation that @jorgecarleitao and I have encountered previously. |
#460 may be a better plan (RecordBatch::slice()) |
Given we now have slice in #460 I don't think this adds much anymore |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Sometimes it is advantageous to split one large
RecordBatch
into smaller batches for processing (for example, processing the multiple smallerRecordBatch
es in parallel)So instead of 1
RecordBatch
with 1M rows, we could have 100RecordBatch
es with 10,000 rows each that could be processed in paralle.@tustvold implemented such a function in https://github.com/apache/arrow-datafusion/pull/379/files
Describe the solution you'd like
Port the
split_batch
function intoRecordBatch::split(batch_size)
or something similar and add appropriate testscc @jorgecarleitao @nevi-me
The text was updated successfully, but these errors were encountered: