Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rust] Lance File Writer #419

Merged
merged 20 commits into from
Jan 9, 2023
Merged

[Rust] Lance File Writer #419

merged 20 commits into from
Jan 9, 2023

Conversation

eddyxu
Copy link
Contributor

@eddyxu eddyxu commented Jan 8, 2023

No description provided.

@eddyxu eddyxu requested a review from changhiskhan January 8, 2023 23:44
@eddyxu eddyxu marked this pull request as ready for review January 8, 2023 23:44
Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions

.iter()
.map(|c| c.max_id())
.max()
.unwrap_or(i32::MIN),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nbd but i thought I saw elsewhere -1 was the place holder? Is it too cumbersome to use Option<u32> as field/parent id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, good catch. fixed.

@@ -420,8 +412,8 @@ impl TryFrom<&ArrowField> for Field {
name: field.name().clone(),
logical_type: LogicalType::try_from(field.data_type())?,
encoding: match field.data_type() {
dt if is_numeric(dt) => Some(Encoding::Plain),
dt if is_binary(dt) => Some(Encoding::VarBinary),
dt if dt.is_numeric() || matches!(dt, DataType::Boolean) => Some(Encoding::Plain),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move is_fixed_stride from datatypes to arrow.rs and use that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -29,13 +29,13 @@ pub mod object_reader;
pub mod object_store;
pub mod object_writer;
pub mod reader;
mod writer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be pub mod?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thining just re-export writer/reader via io, and make writer/reader to be private.

it takes quite some typing to import lance::io::reader::FileReader ? wdyt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine by me. was just curious why reader and writer were different

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thinking to make reader the same as writer later.

cursor: usize,
}

impl<'a> ObjectWriter<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why you got rid of all of the lifetimes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems easier to "create an object" with the internal reference as copy and pass around .

could revisit it if the overhead is too big.

Comment on lines +88 to +89
let num_columns = self.pages.keys().max().unwrap() + 1;
let num_batches = self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

page_table.num_columns and page_table.num_batches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure wdym?

PageTable does not have these two fields.

#[derive(Debug, Default)]
pub struct PageTable {
pages: HashMap<i32, HashMap<i32, PageInfo>>,
pages: BTreeMap<i32, BTreeMap<i32, PageInfo>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here to clarify this is { field_id -> { batch_id -> page_info}} ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pub async fn write(&mut self, batch: &RecordBatch) -> Result<()> {
for field in &self.schema.fields {
let column_id = batch.schema().index_of(&field.name)?;
let array = batch.column(column_id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just implement RecordBatch.column(&str) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean extend it? sure, why not :)

maybe we should have a lance::preclude::*

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lemme do it in the follow up PRs?

Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some CI error due to cargo fmt

@eddyxu eddyxu merged commit e17485d into main Jan 9, 2023
@eddyxu eddyxu deleted the lei/writer branch January 9, 2023 01:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants