Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datafusion fails to read from LanceDataset #3281

Open
dreverri opened this issue Dec 20, 2024 · 2 comments
Open

Datafusion fails to read from LanceDataset #3281

dreverri opened this issue Dec 20, 2024 · 2 comments
Labels
good first issue Good for newcomers

Comments

@dreverri
Copy link

I'm getting the following error when trying to read a LanceDB table with Datafusion:

[2024-12-20T19:10:11Z WARN  lance::dataset::write::insert] No existing dataset at /lance-dataset/data/sample-lancedb/my_table.lance, it will be created
Traceback (most recent call last):
  File "/lance-dataset/hello.py", line 23, in <module>
    main()
    ~~~~^^
  File "/lance-dataset/hello.py", line 19, in main
    df.show()
    ~~~~~~~^^
  File "/lance-dataset/.venv/lib/python3.13/site-packages/datafusion/dataframe.py", line 360, in show
    self.df.show(num)
    ~~~~~~~~~~~~^^^^^
Exception: External error: TypeError: LanceFragment.scanner() takes 1 positional argument but 2 positional arguments (and 3 keyword-only arguments) were given

I'm not sure if this is an issue with LanceDataset or Datafusion or if I am just doing something wrong.

Here is the code:

from datafusion import SessionContext
import lancedb


def main():
    uri = "data/sample-lancedb"
    db = lancedb.connect(uri)

    data = [
        {"vector": [3.1, 4.1], "item": "foo", "price": 10.0},
        {"vector": [5.9, 26.5], "item": "bar", "price": 20.0},
    ]

    tbl = db.create_table("my_table", data=data, mode="overwrite")

    ctx = SessionContext()
    ctx.register_dataset("my_table", tbl.to_lance())
    df = ctx.table("my_table")
    df.show()


if __name__ == "__main__":
    main()
@westonpace
Copy link
Contributor

Looks like you're using datafusion's pyarrow integration to read from a pyarrow dataset. Lance mimics a pyarrow dataset. This is how we are able to be queried from DuckDb. However, it seems that we don't mimic it faithfully enough 😄 and so Datafusion is getting confused.

I seem to recall digging into this a while back and Datafusion want to split up the dataset into fragments and query it that way and we didn't really flesh out the pyarrow fragment integration completely.

So there are two options we can take to fix this. First, we could fix up the python interface to more faithfully mimic pyarrow dataset but pyarrow dataset wasn't really intended to be a standard / protocol and there are a few limitations with this approach:

  • You won't get the proper parallelism on reads
  • Filters are not pushed down (or maybe they are but only a limited subset are supported)
  • Some python overhead (not sure if it is per-batch overhead or not but it might be and that could be significant for some queries)

A different approach (now that apache/datafusion-python#823 has merged) would be to do something like this: https://github.com/delta-io/delta-rs/pull/3012/files

That would be limited to newer versions of datafusion python (43.1 and above) but would overcome the above drawbacks and be easier to maintain.

@westonpace
Copy link
Contributor

(to be clear, both approaches will require changes to Lance)

@eddyxu eddyxu added the good first issue Good for newcomers label Jan 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants