-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split up Arrow Crate #2594
Comments
I like this idea, for what it is worth. 👍 |
I've started work on this with #2693, I think the final split will likely end up being a different from what was initially proposed based on what components can easily be separated. I next plan to split out array data, followed by the arrays themselves. This should then allow splitting out some of the heavier kernels, e.g. sort, compare, cast, etc... |
If the history with the datafusion split is any indication, this work is likely to end up generating lots of PRs You can see how @jimexist broke that down and we tracked it in apache/datafusion#1750 -- perhaps something similar could be applied here. I am happy to try and review these mechanical PRs more quickly so we can get the project done more quickly |
How feasible would it be to move the type definitions (such as |
I don't see an obvious reason why that would not be possible, I'm not sure how generally useful the types will be without the array definitions though... |
The |
@jorgecarleitao Would |
Hey, Thanks for the ping! I think it would not benefit arrow2 directly right now as it has different declarations for With that said, imo it is still a good design - there are systems that only require I think that datafusion's logical plans could also only depend on types, but I could be wrong (it depends on how List scalars are represented there?). |
List scalars are represented as |
I've created #2711 which splits out the schema definitions into a crate called arrow-schema. I thought this was more clearly the logical types than something called arrow-types. PTAL 😄 |
As a downstream user of Arrow, one of the things we find is that we need to fork Arrow-ecosystem crates to quickly integrate patches for missing features or bugs and I think one thing that I'm dreading is having to do that with an exploded Arrow crate, having to fork half a dozen Arrow packages, Parquet, all of the Datafusion crates, ..., seems like it'll be a royal pain. |
Hi @maxburke, the intention is to follow the work that was already performed for DataFusion, and would not involve splitting the repository. So I think you shouldn't have to maintain any more or less forks following this? The [patch.crates-io] directive would need to be for every crate though |
I wonder is it done yet 🙏 |
arrow-row and arrow-arith then yes, will likely do tomorrow |
TLDR rather than fighting entropy lets just brute-force compilation
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The arrow crate is getting rather large, and is starting to show up as a non-trivial bottleneck when compiling code, see #2170. There have been some efforts to reduce the amount of generated code, see #1858, but this is going to be a perpetual losing battle against new feature additions.
I think there are a couple of problems currently:
parquet
depending on compute kernelsAll these conspire to often result in an
arrow
shaped hole in compilation, where CPUs are left idle.Some numbers from my local machine
The vast majority of the time all bar a single core is idle.
Describe the solution you'd like
I would like to propose we split up the arrow crate, into a number of sub-crates that are then re-exported by the top-level
arrow
crate. Users can then choose to depend on the batteries includedarrow
crate, or more granular crates.Initially I would propose the following split:
There is definitely scope for splitting up the crates further after this, in particular the comparison kernels might be a good candidate to live on their own, but I think lets start small and go from there. I suspect there is a fair amount of disentangling that will be necessary to achieve this.
Describe alternatives you've considered
Feature flags are another way this can be handled, however, they have a couple of limitations:
parquet
crate to wait for this to compile before it can start compilingAdditional context
@jimexist recently drove an initiative to do something similar to DataFusion which has worked very well - apache/datafusion#1750
FYI @alamb @jhorstmann @nevi-me
The text was updated successfully, but these errors were encountered: