Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug when specifying nearest and filters #685

Closed
cemoody opened this issue Mar 15, 2023 · 6 comments
Closed

Bug when specifying nearest and filters #685

cemoody opened this issue Mar 15, 2023 · 6 comments

Comments

@cemoody
Copy link

cemoody commented Mar 15, 2023

Was trying to use the ANN search with prefilters:

import lance
import numpy as np
import pyarrow.compute as pc

filters=pc.greater(pc.field('price'), 100.0)
ds = lance.dataset('2.lance')
ds.to_table(filter=filters, nearest={"column": "vector", "q":np.random.random(768)})

And I get the error:

File ~/.pyenv/versions/3.8.16/envs/vector_search/lib/python3.8/site-packages/lance/dataset.py:475, in LanceScanner.to_table(self)
    471 def to_table(self) -> pa.Table:
    472     """
    473     Read the data into memory and return a pyarrow Table.
    474     """
--> 475     return self.to_reader().read_all()

File ~/.pyenv/versions/3.8.16/envs/vector_search/lib/python3.8/site-packages/pyarrow/ipc.pxi:750, in pyarrow.lib.RecordBatchReader.read_all()

File ~/.pyenv/versions/3.8.16/envs/vector_search/lib/python3.8/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: ArrowArray struct has 14 children, expected 15 for type struct<product_id: int64, title: string, handle: string, image_id: double, image_url: string, price: double, shop_id: int64, shop_slug: string, country_code: string, currency_code: string, image: string, major_centroid_id: int64, __index_level_0__: int64, vector: fixed_size_list<item: float>[768], score: float not null>
@changhiskhan
Copy link
Contributor

Thanks for the report! This is a bug in the combined io plan. My guess is that in this branch the score column was omitted. Let me add a unit test to repro and investigate further. If you're blocked, a workaround (though very inefficient) would be to filter the results after the nearest call.

changhiskhan added a commit that referenced this issue Mar 15, 2023
changhiskhan added a commit that referenced this issue Mar 16, 2023
* failing unit test to repro #685

* fix

* address PR comments
@changhiskhan
Copy link
Contributor

changhiskhan commented Mar 16, 2023

@cemoody v0.3.15 has been released with #686 . Lmk if the fix works for you (should not need to re-write data).

(pending pypi uploads here: https://github.com/eto-ai/lance/actions/runs/4434079317)

@cemoody
Copy link
Author

cemoody commented Mar 16, 2023

Great! Trying it out now...

@changhiskhan
Copy link
Contributor

Oh also, just to clarify - these are still post-filters, they filter the ANN results. The combined filtering is on the roadmap after OPQ

@cemoody
Copy link
Author

cemoody commented Mar 16, 2023

Works great! About 4-5x faster than postfiltering for me :)

@changhiskhan
Copy link
Contributor

Sweet!

Will mark this as resolved. As always, thanks for the feedback and bug reports!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants