-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing column when both nearest
and filter
are applied
#686
Conversation
rust/src/dataset/scanner.rs
Outdated
let score_schema = ArrowSchema::new(vec![score]); | ||
|
||
let merged = self | ||
.projections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the differnence of vector_schema
w/ self.projections
? is one a superset of the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=> vector_schema is just the vector and score.
=> self.projections is the user supplied projection (e.g., dataset.to_table(columns=<>, ...), if None then all columns of the dataset is the projection).
Neither is the superset of the other.
Ok(SchemaRef::new(ArrowSchema::from(&merged))) | ||
} else { | ||
Ok(Arc::new(ArrowSchema::from(&self.projections))) | ||
} | ||
} | ||
|
||
fn scanner_output_schema(&self) -> Result<Arc<Schema>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be Scanner::schema
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the API aspect, Scanner::schema
should just return the schema of output. So the contract can be built with other system components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scanner::schema is the ArrowSchema and should not be used in Lance reader internals because that's why the field id's are messed up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll just make Scanner::schema call the other
rust/src/dataset/scanner.rs
Outdated
let score_schema = ArrowSchema::new(vec![column, score]); | ||
let vector_search_columns = &Schema::try_from(&score_schema)?; | ||
let merged = self.projections.merge(vector_search_columns); | ||
let score_schema = ArrowSchema::new(vec![score]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self.vector_search_schema()
?
Ok(Arc::new(LocalTakeExec::new( | ||
filter_node, | ||
self.dataset.clone(), | ||
Arc::new(self.projections.clone()), | ||
output_schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is leaking the abstraction of ann to take?
what is in output_schema nad ann_schema
tho? can we just use one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output_schema includes user supplied projections. ann_schema includes the vector/score columns.
@@ -236,12 +237,15 @@ impl LocalTake { | |||
input: SendableRecordBatchStream, | |||
dataset: Arc<Dataset>, | |||
schema: Arc<Schema>, | |||
ann_schema: Option<Arc<Schema>>, // TODO add input/output schema contract to exec nodes and remove this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File a ticket to remove this leaking abstraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending a new ticket to track the issue of cleaning ann_schema
.
Addresses #685
@eddyxu the fix is hacky due to the lack of formal contracts on the io exec nodes. That's why you see the weird
ann_schema
variable being passed to the local take node. Happy to walk through the logic with you over zoom if that's quicker.