-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce datafusion-objectstore-hdfs as optional features in the datafusion core #2111
Conversation
…tional features in the datafusion core
With the hdfs feature enabled, there's a testing case in datafusion-contrib/datafusion-objectstore-hdfs@17b68ff for verification |
Hi @alamb, @matthewmturner, @yjshen, @houqp, do you think whether it's possible to introduce the object store extensions as optional features like this way? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The potential issue I see with this approach is that there will be a cyclic dependency:
arrow-datafusion --> arrow-datafusion-hdfs
but also
arrow-datafusion <-- arrow-datafusion-hdfs
Is the usecase to bundle (optional) hdfs support with ballista? If so I wonder if it is time to consider splitting ballista into its own repo (so it can rely on datafusion and datafusion-contrib stuff).
Or if HDFS is important to the broader community, we can consider bringing the hdfs plugin into the core datafusion repo 🤔 (but that would also put it under the apache governance / release process)
@@ -60,9 +62,10 @@ async-trait = "0.1.41" | |||
avro-rs = { version = "0.13", features = ["snappy"], optional = true } | |||
chrono = { version = "0.4", default-features = false } | |||
datafusion-common = { path = "../common", version = "7.0.0", features = ["parquet"] } | |||
datafusion-data-access = { path = "../../data-access", version = "1.0.0" } | |||
datafusion-data-access = { git = "https://github.com/apache/arrow-datafusion.git", rev = "41b4e491663029f653e491b110d0b5e74d08a0b6" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change seems a little suspicious to me (as the data access crate should be picked up locally)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alamb, with the datafusion-data-access module spitted out, now the dependencies are as follows:
ballista ⬇️ datafusion ➡️ objectstore-hdfs ⬇️ ⬇️ datafusion-data-access
Therefore, there's no cyclic dependency. Maybe it's better to split the datafusion-data-access into a separate repository and make a first version of release. Then both the datafusion and objectstore-hdfs can depend on the specified version of datafusion-data-access rather than leveraging the rev way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no circular dependency in the cargo dependency graph, but there is a circular dependency on how these crates gets released. The data access crate needs to be released together with datafusion right now, which is what's causing the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Since the data access crate is relatively stable, I'm not sure whether it's necessary to release it together with datafusion. How about splitting it to another repository, for example to the https://github.com/datafusion-contrib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i am leaning towards moving the data access crate into contrib as well. this can further reduce datafusion release overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't datafusion-data-access
a core part of datfusion though. The ObjectStore
trait is baked in everywhere. It seems a bit weird to move that into a contrib module.
Would it be possible to leverage something like #1881 to load contrib ObjectStores
as plugins?
That certainly sounds like a good idea to me I almost wonder if it is time to think about "ballista the codebase" and "ballista the packaged system" as different things -- for example, maybe we could contemplate a script / dockerfile that created ballista packages with whatever optional features were desired and pulled the code from its various sources 🤔 |
Yeah, if we move it out, it would be treated similar to how we use and maintain the sql parser and arrow-rs crate. Being able to load custom object stores as plugins at runtime seems like a good feature to have in the long run so folks can load proprietary object store plugins as well. I also think these two routes are not conflicting with each other so perhaps we can do both? |
I do agree with @houqp to move the data-access out of datafusion and utilize it the same way as we use and maintain the sql parser and arrow-rs crate. |
Marking as draft to make it clear this PR isn't waiting on review -- please feel free to mark it as ready for review if that analysis is not correct or if it is ready to review again |
Which issue does this PR close?
Closes #2072.
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?