-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement RecordBatch::concat #461
Comments
Wouldn't it make it more sense to add this to |
@jorgecarleitao I think that comes down to "is concat a construction operation or a compute operation"? It could probably be classified as either. I don't have a strong opinion about where the "concat RecordBatch" function goes. The only other prior art of using We could perhaps add a |
I'd lean towards Similar to a I think this would aid discoverability, in the sense that a person asking "what can I do with a record batch?" would look at |
I am convinced by this argument 👍 |
Let me try to explain my reasoning atm. All methods exposed on
Generally, I consider non-parallel iterations over a record to be an anti-pattern, since parallelism over columns is one of the hallmarks of columnar formats. Imo the decision of how to iterate over columns does not belong to We do have some methods in compute that The reasoning to have it in |
I think a lean arrow-core crate would be beneficial Wouldn't the same be achieved by |
I think there is a tradeoff between "making the arrow crate accessible (easily usable) for newcomers" and "making it hard for users to write non optimal code" From where I sit there is nothing about adding Splitting |
BTW |
@alamb will it be available in pyarrow? Right now it's kinda PITA to concat RecordBatch instances there. It goes like like: from batches -> to table -> to batches again. |
Hi @pkit 👋 -- pyarrow is developed as part of the arrow mono repo https://github.com/apache/arrow and uses the C++ implementation of Arrow (not the Rust implementation, which is what is in this repo) Thus, I recommend asking there |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
When implementing an operator that needed to check for partitions across
RecordBatch
boundaries (in https://github.com/influxdata/influxdb_iox/pull/1733) I found myself wanting toconcatenate
record batchesIn addition, datafusion also needed this code as well: https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/coalesce_batches.rs#L232
Describe the solution you'd like
Propose adding a function like
Porting the implementation from DataFusion is probably a good place to start and then adding tests / comments.
The text was updated successfully, but these errors were encountered: