Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce datafusion-objectstore-hdfs as optional features in the datafusion core #2111

Closed
wants to merge 2 commits into from
Closed

Conversation

yahoNanJing
Copy link
Contributor

Which issue does this PR close?

Closes #2072.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added ballista datafusion Changes in the datafusion crate labels Mar 29, 2022
@yahoNanJing
Copy link
Contributor Author

With the hdfs feature enabled, there's a testing case in datafusion-contrib/datafusion-objectstore-hdfs@17b68ff for verification

@yahoNanJing
Copy link
Contributor Author

Hi @alamb, @matthewmturner, @yjshen, @houqp, do you think whether it's possible to introduce the object store extensions as optional features like this way?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The potential issue I see with this approach is that there will be a cyclic dependency:

arrow-datafusion --> arrow-datafusion-hdfs

but also

arrow-datafusion <-- arrow-datafusion-hdfs

Is the usecase to bundle (optional) hdfs support with ballista? If so I wonder if it is time to consider splitting ballista into its own repo (so it can rely on datafusion and datafusion-contrib stuff).

Or if HDFS is important to the broader community, we can consider bringing the hdfs plugin into the core datafusion repo 🤔 (but that would also put it under the apache governance / release process)

@@ -60,9 +62,10 @@ async-trait = "0.1.41"
avro-rs = { version = "0.13", features = ["snappy"], optional = true }
chrono = { version = "0.4", default-features = false }
datafusion-common = { path = "../common", version = "7.0.0", features = ["parquet"] }
datafusion-data-access = { path = "../../data-access", version = "1.0.0" }
datafusion-data-access = { git = "https://github.com/apache/arrow-datafusion.git", rev = "41b4e491663029f653e491b110d0b5e74d08a0b6" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems a little suspicious to me (as the data access crate should be picked up locally)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb, with the datafusion-data-access module spitted out, now the dependencies are as follows:

                      ballista
                           ⬇️ 
                    datafusion   ➡️   objectstore-hdfs
                           ⬇️             ⬇️ 
                       datafusion-data-access

Therefore, there's no cyclic dependency. Maybe it's better to split the datafusion-data-access into a separate repository and make a first version of release. Then both the datafusion and objectstore-hdfs can depend on the specified version of datafusion-data-access rather than leveraging the rev way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no circular dependency in the cargo dependency graph, but there is a circular dependency on how these crates gets released. The data access crate needs to be released together with datafusion right now, which is what's causing the problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Since the data access crate is relatively stable, I'm not sure whether it's necessary to release it together with datafusion. How about splitting it to another repository, for example to the https://github.com/datafusion-contrib?

Copy link
Member

@houqp houqp Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i am leaning towards moving the data access crate into contrib as well. this can further reduce datafusion release overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't datafusion-data-access a core part of datfusion though. The ObjectStore trait is baked in everywhere. It seems a bit weird to move that into a contrib module.

Would it be possible to leverage something like #1881 to load contrib ObjectStores as plugins?

@alamb
Copy link
Contributor

alamb commented Mar 31, 2022

Would it be possible to leverage something like #1881 to load contrib ObjectStores as plugins?

That certainly sounds like a good idea to me

I almost wonder if it is time to think about "ballista the codebase" and "ballista the packaged system" as different things -- for example, maybe we could contemplate a script / dockerfile that created ballista packages with whatever optional features were desired and pulled the code from its various sources 🤔

@houqp
Copy link
Member

houqp commented Apr 2, 2022

Isn't datafusion-data-access a core part of datfusion though. The ObjectStore trait is baked in everywhere. It seems a bit weird to move that into a contrib module.

Yeah, if we move it out, it would be treated similar to how we use and maintain the sql parser and arrow-rs crate.

Being able to load custom object stores as plugins at runtime seems like a good feature to have in the long run so folks can load proprietary object store plugins as well.

I also think these two routes are not conflicting with each other so perhaps we can do both?

@yahoNanJing
Copy link
Contributor Author

I do agree with @houqp to move the data-access out of datafusion and utilize it the same way as we use and maintain the sql parser and arrow-rs crate.

@alamb
Copy link
Contributor

alamb commented Apr 15, 2022

Marking as draft to make it clear this PR isn't waiting on review -- please feel free to mark it as ready for review if that analysis is not correct or if it is ready to review again

@alamb alamb marked this pull request as draft April 15, 2022 15:34
@andygrove andygrove removed the datafusion Changes in the datafusion crate label Jun 3, 2022
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce the object stores in datafusion-contrib as optional features in the datafusion core
6 participants