Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating IVF_PQ index on vector column with nulls fails if distance_type = "dot" #3402

Closed
wjones127 opened this issue Jan 21, 2025 · 1 comment · Fixed by #3404
Closed

Creating IVF_PQ index on vector column with nulls fails if distance_type = "dot" #3402

wjones127 opened this issue Jan 21, 2025 · 1 comment · Fixed by #3404
Assignees
Labels
bug Something isn't working

Comments

@wjones127
Copy link
Contributor

import pyarrow as pa
import pyarrow.compute as pc
import lance

nrows = 10_000
dims = 16
values = pc.random(nrows * dims).cast(pa.float32())
vectors = pa.FixedSizeListArray.from_arrays(values, dims)
set_null = pc.less(pc.random(nrows), 0.7) # 70% nulls
vectors = pc.if_else(set_null, pa.scalar(None), vectors)
table = pa.table({"vectors": vectors})

ds = lance.write_dataset(table, "memory://")
# Succeeds:
ds.create_index("vectors", "IVF_PQ", metric="l2", num_partitions=2, num_sub_vectors=2)

# Fails:
ds.create_index("vectors", "IVF_PQ", metric="dot", num_partitions=2, num_sub_vectors=2, replace=True)
thread '<unnamed>' panicked at /Users/runner/work/lance/lance/rust/lance-index/src/vector/pq.rs:373:9:
assertion `left == right` failed
  left: 7045
 right: 0
stack backtrace:
   0: _rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
   4: <lance_index::vector::pq::ProductQuantizer as lance_index::vector::quantizer::Quantization>::build
   5: lance::index::vector::ivf::build_ivf_pq_index::{{closure}}
   6: lance::index::vector::build_vector_index::{{closure}}::{{closure}}
   7: lance::index::<impl lance_index::traits::DatasetIndexExt for lance::dataset::Dataset>::create_index::{{closure}}::{{closure}}
   8: lance::index::<impl lance_index::traits::DatasetIndexExt for lance::dataset::Dataset>::create_index::{{closure}}
   9: lance::executor::BackgroundExecutor::result_or_interrupt::{{closure}}
  10: lance::executor::BackgroundExecutor::block_on
  11: lance::dataset::Dataset::create_index
  12: lance::dataset::Dataset::__pymethod_create_index__
  13: pyo3::impl_::trampoline::trampoline
  14: lance::dataset::<impl pyo3::impl_::pyclass::PyMethods<lance::dataset::Dataset> for pyo3::impl_::pyclass::PyClassImplCollector<lance::dataset::Dataset>>::py_methods::ITEMS::trampoline
  15: __PyEval_EvalFrameDefault
  16: _PyEval_EvalCode
  17: _run_mod
  18: _PyRun_InteractiveOneObjectEx
  19: __PyRun_InteractiveLoopObject
  20: __PyRun_AnyFileObject
  21: _PyRun_AnyFileExFlags
  22: _Py_RunMain
  23: _pymain_main
  24: _main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/willjones/mambaforge/envs/lancedb-dev/lib/python3.11/site-packages/lance/dataset.py", line 2207, in create_index
    self._ds.create_index(
pyo3_runtime.PanicException: assertion `left == right` failed
  left: 7045
 right: 0
@wjones127 wjones127 added the bug Something isn't working label Jan 21, 2025
@wjones127 wjones127 self-assigned this Jan 21, 2025
@wjones127
Copy link
Contributor Author

The root cause is that we don't ignore nulls when sampling training data. The fact that this works for L2 and Cosine appears to largely be an accident, and may be producing incorrect values.

We will change the sampling method to only sample non-null rows.

wjones127 added a commit that referenced this issue Jan 28, 2025
We were not filtering out null values when sampling. Because we often
call `array.values()` on Arrow arrays, which ignores the null bitmap, we
are often silently treating the nulls as zeros (or possibly undefined
values). Only thing that caught these nulls is an assertion. However,
residualization occurring with L2 and Cosine often meant that these
values were transformed and null information was lost before the
assertion, which is why it got past previous unit tests.

This PR adds more assertions validating there aren't nulls, and makes
sure the sampling code handles null vectors.

Closes #3402
Closes #3400
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant