Why not use DataFusion as the backend engine (rather than rewriting it all from scratch)? #3319

alamb · 2024-11-13T17:19:24Z

alamb
Nov 13, 2024

Is your feature request related to a problem?

I read this blog. Very cool (and thanks @MrPowers) for pointing it out

https://dataengineeringcentral.substack.com/p/introduction-to-daft-vs-polars

The experience of reading from S3 and handling globs, etc looks pretty amazing with Daft 💯 nice work

I have a question: If your focus is on an amazing experience working with objects on remote object store, why build your own Rust / Arrow based vectorized query engine (yet again) as well when there are already exiting engines?

I can't help but notice there is a substantial amount of similarity between the current structure of your code and DataFusion (Arrow, LogicalPlan, PhysicalPlan, etc) -- likely as a consequence of it being fairly well understood how to build columnar query engines. It is

I also saw several issues that have started enumerating features that are missing in Daft that are already in DataFusion (along with tests, etc)

Describe the solution you'd like

I propose (very selfishly, of course, as a PMC member) that you build on Apache DataFusion and better yet help us extend it to make it even better

You would likely get a full featured, fast engine, and instead of reinventing all the standard, low level operation for processing you could instead focus on making the DataFrame experience processing files remotely on object store amazing.

I can tell you from experience building and maturing a execution engine takes a lot of effort and it is great to do it with a big and established community.

Describe alternatives you've considered

I would also love to know if you have thought about DataFusion and chose a different approach anyways and any thoughts you were willing to share about why you made the choice (so we can see if we can make it easier for future startups to choose to build with DataFusion)

Additional Context

Happy hacking!

Would you like to implement a fix?

No

maruschin · 2024-11-13T17:26:34Z

maruschin
Nov 13, 2024

Yeah, looks like a reinvention of datafusion

0 replies

MrPowers · 2024-11-13T19:08:06Z

MrPowers
Nov 13, 2024

Speaking from experience, the DataFusion and Daft communities are both awesome to work with!

We have a weekly DataFusion call. Perhaps the Daft devs could join one week to brainstorm!

0 replies

samster25 · 2024-11-13T23:35:21Z

samster25
Nov 13, 2024
Maintainer

Hey @alamb! Really exciting to hear from you :) thanks for making this thread. Your kind words mean a lot for us and the team.

We’re huge fans of the Datafusion project and all the extremely exciting and inspiring work that’s been going on there. A lot of our work has been very obviously inspired by work in Datafusion.

any thoughts you were willing to share about why you made the choice

When we first started transitioning Daft to Rust about 2 years ago the ecosystem was a lot more fragmented and it wasn’t yet clear what we should adopt At the time, Datafusion and arrow-rs were very young projects and didn’t support some of the requirements that we had.

The key things here were:

Support for our own type system for our own logical types like images and tensors as well as out of band types like Python objects.
We originally were executing kernels in a non-async environment
Strict requirements on being able to serialize physical plans for a distributed environment.
We wanted a transmute free library for our in-memory representation. (I was burned by arrow-cpp lol)

We ultimately decided on using arrow2 for the in-memory representation and kernels. But all the key query engine components were still in Python. As time went on, we moved more and more from Python land into Rust. Such as our DSL, Logical Plan and Physical Plan.

(so we can see if we can make it easier for future startups to choose to build with DataFusion)

Today that is less clear. Datafusion has an amazing community and has made big leaps within the last 2 years. However we’ve also invested heavily in a lot of our own stack which makes a migration both costly and difficult to justify. If we were to start a new project today, I think our vote would have been unequivocally to start off with Datafusion. Datafusion makes it incredibly easy to get a project off the ground.

In reality however, we’re a small team with big distributed ambitions and unfortunately Datafusion integration is more of a nice-to-have right now rather than a core accelerant.

That being said, there are a few items on our roadmap which we are actively exploring, which should bring us much closer to the rest of the community.

Migration from arrow2 to arrow-rs arrow2 has been pretty much abandoned by the community. We would love to move Daft to arrow-rs and contribute back to it.
Leverage Datafusion's Logical Plan and Optimizer as you mentioned Datafusion already has a ton of optimizer rules and is quite robust. This not really an area we feel the need to innovate on. The key requirement there for us is being able to plug in our own type system, logical types and expressions.
Data Catalog and Table Format integrations - we’re very excited about this and have been on the forefront of integrating with a lot of these technologies (outside of the JVM). Our current integrations are in Python, but we do want to transition to more native Rust integrations and are in talks with the relevant Rust projects here.
Pluggable engines: we are just about to finish up with our new local execution engine, and at some point will be adding other engines such as a newly designed distributed engine, or even a proprietary one. Plugging in Datafusion here as a POC could be interesting, and we could even try that on a distributed runner.

I think (4) is potentially the most interesting story, but will obviously require a lot of work and investment from both Daft and Datafusion to make happen. We’ve spent a lot of time, effort, money, blood, sweat, tears, etc... into thinking about distributed and cloud computing and making Daft work at Terabyte -> Petabyte scale. I know there has been interest within the Datafusion community to extend into that space (e.g. datafusion-ray, ballista etc).

Happy to explore what a collaboration there would look like and maybe lay out what the benefits of adopting Datafusion would be for the Daft project.

Would love to attend the DataFusion call btw :)

0 replies

alamb · 2024-11-14T06:59:33Z

alamb
Nov 14, 2024
Author

Thanks for the response @samster25 🤗

When we first started transitioning Daft to Rust about 2 years ago the ecosystem was a lot more fragmented and it wasn’t yet clear what we should adopt At the time, Datafusion and arrow-rs were very young projects and didn’t support some of the requirements that we had.

This makes total sense 👍

Also makes sense. For anyone following along, here are some feature updates from the DataFusion side (I am absolutely trying to convince anyone reading this to build with DataFusion :) )

Support for our own type system for our own logical types like images and tensors as well as out of band types like Python objects.

This is still an area that DataFuson is not as flexible as we might like. Thankfully @notfilippo, @jayzhan211, @findepi and others are working on this (see apache/datafusion#12853 and related)

Strict requirements on being able to serialize physical plans for a distributed environment.

This is now fully supported via datafusion-ptoto

We wanted a transmute free library for our in-memory representation. (I was burned by arrow-cpp lol)

I think strictly speaking arrow-rs has one use of transmute left (but it is used to create a floating point constant, so likely not the horror 😱 show it might sound like)

Over time we (well, really largely @tustvold) worked hard to bring the best of the design patterns from arrow2 into arrow-rs, and for what it is worth we haven't had any safety issues / crashes reported in over 2 years to my knowledge. There is still unsafe code (as there is in most performance critical code, including arrow2) but it is used carefully

(so we can see if we can make it easier for future startups to choose to build with DataFusion)

Today that is less clear. Datafusion has an amazing community and has made big leaps within the last 2 years. However we’ve also invested heavily in a lot of our own stack which makes a migration both costly and difficult to justify. If we were to start a new project today, I think our vote would have been unequivocally to start off with Datafusion. Datafusion makes it incredibly easy to get a project off the ground.

❤️

In reality however, we’re a small team with big distributed ambitions and unfortunately Datafusion integration is more of a nice-to-have right now rather than a core accelerant.

Indeed -- rewriting / replacing an already working system is not something to be done lightly. As you say, the core story would have to be that it accelerates your new features somehow.

That being said, there are a few items on our roadmap which we are actively exploring, which should bring us much closer to the rest of the community.

Migration from arrow2 to arrow-rs arrow2 has been pretty much abandoned by the community. We would love to move Daft to arrow-rs and contribute back to it.

Leverage Datafusion's Logical Plan and Optimizer as you mentioned Datafusion already has a ton of optimizer rules and is quite robust. This not really an area we feel the need to innovate on. The key requirement there for us is being able to plug in our own type system, logical types and expressions.

Maybe https://substrait.io/ or something similar could help here 🤔 (DataFusion supports substrait)

Data Catalog and Table Format integrations - we’re very excited about this and have been on the forefront of integrating with a lot of these technologies (outside of the JVM). Our current integrations are in Python, but we do want to transition to more native Rust integrations and are in talks with the relevant Rust projects here.

This is also an area I want to make easier for DataFusion too (iceberg and deltalake specifically). I think it is still too hard to integrate these pieces into a system that "just works" and reads iceberg tables, etc

We have some cool stuff coming up with the new FFI bindings from @timsaucer (see apache/datafusion#13175 / https://crates.io/crates/datafusion-ffi). It is still early

Pluggable engines: we are just about to finish up with our new local execution engine, and at some point will be adding other engines such as a newly designed distributed engine, or even a proprietary one. Plugging in Datafusion here as a POC could be interesting, and we could even try that on a distributed runner.

I think (4) is potentially the most interesting story, but will obviously require a lot of work and investment from both Daft and Datafusion to make happen. We’ve spent a lot of time, effort, money, blood, sweat, tears, etc... into thinking about distributed and cloud computing and making Daft work at Terabyte -> Petabyte scale. I know there has been interest within the Datafusion community to extend into that space (e.g. datafusion-ray, ballista etc).

Happy to explore what a collaboration there would look like and maybe lay out what the benefits of adopting Datafusion would be for the Daft project.

I think the basic story would be "you don't have to go reimplement the guts of the query engine" -- there are also several other people working on DataFusion based distributed engines, so I expect that supporting infrastructure for that will improve over time.

Would love to attend the DataFusion call btw :)

We would love to have you

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not use DataFusion as the backend engine (rather than rewriting it all from scratch)? #3319

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why not use DataFusion as the backend engine (rather than rewriting it all from scratch)? #3319

alamb Nov 13, 2024

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional Context

Would you like to implement a fix?

Replies: 4 comments

maruschin Nov 13, 2024

MrPowers Nov 13, 2024

samster25 Nov 13, 2024 Maintainer

alamb Nov 14, 2024 Author

alamb
Nov 13, 2024

maruschin
Nov 13, 2024

MrPowers
Nov 13, 2024

samster25
Nov 13, 2024
Maintainer

alamb
Nov 14, 2024
Author