Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow engine data cleanup #153

Merged
merged 3 commits into from
Mar 27, 2024
Merged

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented Mar 27, 2024

This PR does a couple of small things:

  • Verifies types on Lists and Maps in the EngineData impl for Arrow and returns errors if it's not all Strings
  • Supports LargeList types (offsets stored as i64. probably not needed, but trivial to add)
  • Moves the GetData impls for ArrowEngineData into the src/client dir, since that's not specific to the sync client. Previous code would actually have failed to build if you tried to build only the default client without the sync one.
  • moves the try_create_from_json and try_create_from_parquet functions into the sync client files where they are used, since they are specific to that.

closes #121


use crate::client::arrow_data::ArrowEngineData;

pub(crate) struct SyncJsonHandler;

fn try_create_from_json(schema: SchemaRef, location: Url) -> DeltaResult<ArrowEngineData> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just moved from arrow_data.rs

};

pub(crate) struct SyncParquetHandler;

fn try_create_from_parquet(schema: SchemaRef, location: Url) -> DeltaResult<ArrowEngineData> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just moved from arrow_data.rs

@nicklan nicklan requested a review from hntd187 March 27, 2024 18:14
Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, just a few nits.

Comment on lines +274 to +279
if field.data_type() != &ArrowDataType::Utf8 {
return Err(Error::UnexpectedColumnType(format!(
"On {}: Only support maps of String->String",
field.name()
)));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Looks like another set of use cases for #148 require! macro...

Comment on lines 9 to 14
client::{
arrow_data::ArrowEngineData,
arrow_utils::{generate_mask, get_requested_indices, reorder_record_batch},
},
schema::SchemaRef,
DeltaResult, Error, Expression, FileDataReadResultIterator, FileMeta, ParquetHandler,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: These deeply nested use {} blocks aren't super readable IMO...

    use crate::client::arrow_data::ArrowEngineData;
    use crate::client::arrow_utils::{generate_mask, get_requested_indices, reorder_record_batch};
    use crate::schema::SchemaRef;
    use crate::{DeltaResult, Error, Expression, FileDataReadResultIterator, FileMeta, ParquetHandler};

let mut reader = builder.build()?;
let data = reader
.next()
.ok_or(Error::generic("No data found reading parquet file"))?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid the cost of constructing the error string?

Suggested change
.ok_or(Error::generic("No data found reading parquet file"))?;
.ok_or_else(|| Error::generic("No data found reading parquet file"))?;

(several others, we should probably audit the code and fix them all)

(not sure if there's a way to make clippy warn about it these?)

Comment on lines 24 to 63
schema.clone(),
location,
);
let d = try_create_from_parquet(schema.clone(), location);
d.map(|d| Box::new(d) as _)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: fits comfortably on a single line?

            try_create_from_parquet(schema.clone(), location).map(|d| Box::new(d) as _)

kernel/src/client/sync/parquet.rs Show resolved Hide resolved
let file = File::open(
location
.to_file_path()
.map_err(|_| Error::generic("can only read local files"))?,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Wow Url::to_file_path... I've never seen Err(()) before

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, I know @hntd187 LOVES this error type :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Nick has heard a good amount of my strong dislike of this error signature, I had hoped they would at least move to something like Infallible or ! then it comes to stable, but yeah, still ()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think Err(()) is Infallible? It just means the error Result carries no information other than "I'm an error"?

Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@hntd187 hntd187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT

@nicklan nicklan merged commit 2df263d into delta-io:main Mar 27, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

verify that lists and maps contain the expected types
4 participants