Replies: 4 comments
-
Yeah, looks like a reinvention of datafusion |
Beta Was this translation helpful? Give feedback.
-
Speaking from experience, the DataFusion and Daft communities are both awesome to work with! We have a weekly DataFusion call. Perhaps the Daft devs could join one week to brainstorm! |
Beta Was this translation helpful? Give feedback.
-
Hey @alamb! Really exciting to hear from you :) thanks for making this thread. Your kind words mean a lot for us and the team. We’re huge fans of the Datafusion project and all the extremely exciting and inspiring work that’s been going on there. A lot of our work has been very obviously inspired by work in Datafusion.
When we first started transitioning Daft to Rust about 2 years ago the ecosystem was a lot more fragmented and it wasn’t yet clear what we should adopt At the time, Datafusion and arrow-rs were very young projects and didn’t support some of the requirements that we had. The key things here were:
We ultimately decided on using arrow2 for the in-memory representation and kernels. But all the key query engine components were still in Python. As time went on, we moved more and more from Python land into Rust. Such as our DSL, Logical Plan and Physical Plan.
Today that is less clear. Datafusion has an amazing community and has made big leaps within the last 2 years. However we’ve also invested heavily in a lot of our own stack which makes a migration both costly and difficult to justify. If we were to start a new project today, I think our vote would have been unequivocally to start off with Datafusion. Datafusion makes it incredibly easy to get a project off the ground. In reality however, we’re a small team with big distributed ambitions and unfortunately Datafusion integration is more of a nice-to-have right now rather than a core accelerant. That being said, there are a few items on our roadmap which we are actively exploring, which should bring us much closer to the rest of the community.
I think (4) is potentially the most interesting story, but will obviously require a lot of work and investment from both Daft and Datafusion to make happen. We’ve spent a lot of time, effort, money, blood, sweat, tears, etc... into thinking about distributed and cloud computing and making Daft work at Terabyte -> Petabyte scale. I know there has been interest within the Datafusion community to extend into that space (e.g. datafusion-ray, ballista etc). Happy to explore what a collaboration there would look like and maybe lay out what the benefits of adopting Datafusion would be for the Daft project. Would love to attend the DataFusion call btw :) |
Beta Was this translation helpful? Give feedback.
-
Thanks for the response @samster25 🤗
This makes total sense 👍 Also makes sense. For anyone following along, here are some feature updates from the DataFusion side (I am absolutely trying to convince anyone reading this to build with DataFusion :) )
This is still an area that DataFuson is not as flexible as we might like. Thankfully @notfilippo, @jayzhan211, @findepi and others are working on this (see apache/datafusion#12853 and related)
This is now fully supported via datafusion-ptoto
I think strictly speaking Over time we (well, really largely @tustvold) worked hard to bring the best of the design patterns from arrow2 into arrow-rs, and for what it is worth we haven't had any safety issues / crashes reported in over 2 years to my knowledge. There is still
❤️
Indeed -- rewriting / replacing an already working system is not something to be done lightly. As you say, the core story would have to be that it accelerates your new features somehow.
Maybe https://substrait.io/ or something similar could help here 🤔 (DataFusion supports substrait)
This is also an area I want to make easier for DataFusion too (iceberg and deltalake specifically). I think it is still too hard to integrate these pieces into a system that "just works" and reads iceberg tables, etc We have some cool stuff coming up with the new FFI bindings from @timsaucer (see apache/datafusion#13175 / https://crates.io/crates/datafusion-ffi). It is still early
I think the basic story would be "you don't have to go reimplement the guts of the query engine" -- there are also several other people working on DataFusion based distributed engines, so I expect that supporting infrastructure for that will improve over time.
We would love to have you |
Beta Was this translation helpful? Give feedback.
-
Is your feature request related to a problem?
I read this blog. Very cool (and thanks @MrPowers) for pointing it out
https://dataengineeringcentral.substack.com/p/introduction-to-daft-vs-polars
The experience of reading from S3 and handling globs, etc looks pretty amazing with Daft 💯 nice work
I have a question: If your focus is on an amazing experience working with objects on remote object store, why build your own Rust / Arrow based vectorized query engine (yet again) as well when there are already exiting engines?
I can't help but notice there is a substantial amount of similarity between the current structure of your code and DataFusion (
Arrow
,LogicalPlan
,PhysicalPlan
, etc) -- likely as a consequence of it being fairly well understood how to build columnar query engines. It isI also saw several issues that have started enumerating features that are missing in Daft that are already in DataFusion (along with tests, etc)
Describe the solution you'd like
I propose (very selfishly, of course, as a PMC member) that you build on Apache DataFusion and better yet help us extend it to make it even better
You would likely get a full featured, fast engine, and instead of reinventing all the standard, low level operation for processing you could instead focus on making the DataFrame experience processing files remotely on object store amazing.
I can tell you from experience building and maturing a execution engine takes a lot of effort and it is great to do it with a big and established community.
Describe alternatives you've considered
I would also love to know if you have thought about DataFusion and chose a different approach anyways and any thoughts you were willing to share about why you made the choice (so we can see if we can make it easier for future startups to choose to build with DataFusion)
Additional Context
Happy hacking!
Would you like to implement a fix?
No
Beta Was this translation helpful? Give feedback.
All reactions