Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for a custom table provider #941

Closed
westonpace opened this issue Nov 6, 2024 · 2 comments · Fixed by #921
Closed

Support for a custom table provider #941

westonpace opened this issue Nov 6, 2024 · 2 comments · Fixed by #921
Labels
enhancement New feature or request

Comments

@westonpace
Copy link
Member

This is similar to #920 but maybe more specific. Lance (https://github.com/lancedb/lance) has a custom table provider and I was interested in using datafusion-python with this table provider. However, I'm not sure there is an easy solution.

I was hoping, in Lance's python bindings, I could just do something like...

use datafusion_python::context::PySessionContext;

#[pymethod]
pub fn register_datafusion(ctx: &PySessionContext, tbl_name: String, ds_uri: String) -> PyResult<()> {
    // ...
}

Then use this in python as:

from datafusion import SessionContext
from .lance import register_datafusion

ctx = SessionContext()
register_datafusion(ctx.ctx, "my_tbl", "some_uri")

Unfortunately, this leads to:

TypeError: argument 'ctx': 'SessionContext' object cannot be converted to 'SessionContext'

I suspect the problem is that the SessionContext linked into lance's python module is different from the SessionContext linked into datafusion_python's python module.

Here's a few thoughts off the top of my head. Maybe there is something easier I am missing however.

  1. Add Lance to datafusion-python

A simple, but not ideal, solution is to just add lance as a dependency to datafusion-python. I'm assuming that the datafusion-python project doesn't want 3rd party dependencies however.

  1. Use pyarrow dataset as a "dataset protocol"

The "dataset protocol" never got quite finished but we can kind of use pyarrow datasets as the dataset protocol. This is actually what I've ended up using for the time being. I use register_dataset and LanceDataset already duck types as a pyarrow dataset so this works but it's not as flexible.

  1. Add support via datafusion-federation

I'm not entirely sure this is possible but it seems the datafusion-federation project may have a way of handling abstract table providers over Substrait. datafusion-python could add datafusion-federation as a dependency to allow a register_federated method.

@westonpace westonpace added the enhancement New feature or request label Nov 6, 2024
@timsaucer
Copy link
Contributor

We will have this supported in datafusion-python 43.0.0! As soon as upstream updates, we will make a few small changes to #921 and get that merged in. Then we will be able to support this. Upstream datafusion 43.0.0 is under release review right now.

@westonpace
Copy link
Member Author

A stable FFI for table providers! I had no idea. Very cool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants