Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing column when both nearest and filter are applied #686

Merged
merged 3 commits into from
Mar 16, 2023

Conversation

changhiskhan
Copy link
Contributor

@changhiskhan changhiskhan commented Mar 15, 2023

Addresses #685

  • reproduce issue
  • RCA => score column is missing due to incorrect projection when filters are present for ANN.
  • Fix

@eddyxu the fix is hacky due to the lack of formal contracts on the io exec nodes. That's why you see the weird ann_schema variable being passed to the local take node. Happy to walk through the logic with you over zoom if that's quicker.

@changhiskhan changhiskhan requested a review from eddyxu March 16, 2023 04:05
@changhiskhan changhiskhan marked this pull request as ready for review March 16, 2023 04:05
let score_schema = ArrowSchema::new(vec![score]);

let merged = self
.projections
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the differnence of vector_schema w/ self.projections? is one a superset of the other?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> vector_schema is just the vector and score.
=> self.projections is the user supplied projection (e.g., dataset.to_table(columns=<>, ...), if None then all columns of the dataset is the projection).

Neither is the superset of the other.

Ok(SchemaRef::new(ArrowSchema::from(&merged)))
} else {
Ok(Arc::new(ArrowSchema::from(&self.projections)))
}
}

fn scanner_output_schema(&self) -> Result<Arc<Schema>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be Scanner::schema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the API aspect, Scanner::schema should just return the schema of output. So the contract can be built with other system components.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scanner::schema is the ArrowSchema and should not be used in Lance reader internals because that's why the field id's are messed up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll just make Scanner::schema call the other

let score_schema = ArrowSchema::new(vec![column, score]);
let vector_search_columns = &Schema::try_from(&score_schema)?;
let merged = self.projections.merge(vector_search_columns);
let score_schema = ArrowSchema::new(vec![score]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.vector_search_schema()?

Ok(Arc::new(LocalTakeExec::new(
filter_node,
self.dataset.clone(),
Arc::new(self.projections.clone()),
output_schema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is leaking the abstraction of ann to take?

what is in output_schema nad ann_schema tho? can we just use one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

output_schema includes user supplied projections. ann_schema includes the vector/score columns.

@@ -236,12 +237,15 @@ impl LocalTake {
input: SendableRecordBatchStream,
dataset: Arc<Dataset>,
schema: Arc<Schema>,
ann_schema: Option<Arc<Schema>>, // TODO add input/output schema contract to exec nodes and remove this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File a ticket to remove this leaking abstraction?

Copy link
Contributor

@eddyxu eddyxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending a new ticket to track the issue of cleaning ann_schema.

@changhiskhan changhiskhan merged commit 9a3364c into main Mar 16, 2023
@changhiskhan changhiskhan deleted the changhiskhan/nearest-filter branch March 16, 2023 05:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants