-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Rust] Lance File Writer #419
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions
rust/src/datatypes.rs
Outdated
.iter() | ||
.map(|c| c.max_id()) | ||
.max() | ||
.unwrap_or(i32::MIN), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nbd but i thought I saw elsewhere -1
was the place holder? Is it too cumbersome to use Option<u32>
as field/parent id?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, good catch. fixed.
rust/src/datatypes.rs
Outdated
@@ -420,8 +412,8 @@ impl TryFrom<&ArrowField> for Field { | |||
name: field.name().clone(), | |||
logical_type: LogicalType::try_from(field.data_type())?, | |||
encoding: match field.data_type() { | |||
dt if is_numeric(dt) => Some(Encoding::Plain), | |||
dt if is_binary(dt) => Some(Encoding::VarBinary), | |||
dt if dt.is_numeric() || matches!(dt, DataType::Boolean) => Some(Encoding::Plain), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move is_fixed_stride
from datatypes to arrow.rs and use that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -29,13 +29,13 @@ pub mod object_reader; | |||
pub mod object_store; | |||
pub mod object_writer; | |||
pub mod reader; | |||
mod writer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be pub mod?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was thining just re-export writer/reader via io, and make writer/reader to be private.
it takes quite some typing to import lance::io::reader::FileReader
? wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fine by me. was just curious why reader and writer were different
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
was thinking to make reader
the same as writer
later.
cursor: usize, | ||
} | ||
|
||
impl<'a> ObjectWriter<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why you got rid of all of the lifetimes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems easier to "create an object" with the internal reference as copy and pass around .
could revisit it if the overhead is too big.
let num_columns = self.pages.keys().max().unwrap() + 1; | ||
let num_batches = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
page_table.num_columns and page_table.num_batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure wdym?
PageTable does not have these two fields.
#[derive(Debug, Default)] | ||
pub struct PageTable { | ||
pages: HashMap<i32, HashMap<i32, PageInfo>>, | ||
pages: BTreeMap<i32, BTreeMap<i32, PageInfo>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here to clarify this is { field_id -> { batch_id -> page_info}} ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pub async fn write(&mut self, batch: &RecordBatch) -> Result<()> { | ||
for field in &self.schema.fields { | ||
let column_id = batch.schema().index_of(&field.name)?; | ||
let array = batch.column(column_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just implement RecordBatch.column(&str)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean extend it? sure, why not :)
maybe we should have a lance::preclude::*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lemme do it in the follow up PRs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some CI error due to cargo fmt
No description provided.