-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generate parquet schema from rust struct #539
Conversation
Might be of interest to @xrl and @bryantbiggs as you were the original authors |
Codecov Report
@@ Coverage Diff @@
## master #539 +/- ##
==========================================
- Coverage 82.60% 82.45% -0.15%
==========================================
Files 167 167
Lines 45984 46084 +100
==========================================
+ Hits 37984 37999 +15
- Misses 8000 8085 +85
Continue to review full report at Codecov.
|
nice!!! 🎉 |
aa6b1b0
to
e13b9e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a nice improvement to me.
@@ -69,7 +64,7 @@ mod parquet_field; | |||
/// } | |||
/// ]; | |||
/// | |||
/// let schema = Arc::new(parse_message_type(schema_str).unwrap()); | |||
/// let schema = samples.as_slice().schema(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be cool to avoid having to use as_slice()
but this PR seems significantly better than previously, so 👍 for me. We can make it even better in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I didn't like it too, but RecordWriter
is implemented on &[Struct], so this was the least disruptive change to the trait I could make
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could implement it for anything that implemented IntoIterator<Item=P>
where P: AsRef<Struct>
or whatever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try that out in future PRs. The code that gets generated iterates through the slice n
times, where n
is the number of fields. It's not a true record writer in the parquet sense, but it uses the column writer internally.
I think we'll be able to improve on it when I work on nested type support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate the extra attention to this code. My team relies on the derive macro to translate simple structs to parquet, which we do very often in our data pipeline. I'd be happy to test out future, more comprehensive versions. I applaud you effort, I was too intimidated by the record writer to keep going!
use parquet::basic::*; | ||
|
||
let mut fields: Vec<TypePtr> = Vec::new(); | ||
#( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty cool 🧙 👍
@@ -174,6 +174,50 @@ impl Field { | |||
} | |||
} | |||
|
|||
pub fn parquet_type(&self) -> proc_macro2::TokenStream { | |||
// TODO: Support group types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add notes about these TODO's in the docstrings (aka "group types are not yet supported") ?
@@ -57,27 +64,32 @@ mod tests { | |||
#[test] | |||
fn test_parquet_derive_hello() { | |||
let file = get_temp_file("test_parquet_derive_hello", &[]); | |||
let schema_str = "message schema { | |||
|
|||
// The schema is not required, but this tests that the generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
let schema = Arc::new(parse_message_type(schema_str).unwrap()); | ||
let generated_schema = drs.as_slice().schema().unwrap(); | ||
|
||
assert_eq!(&schema, &generated_schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
I plan to merge this later today (will be included in 5.0) unless anyone else wants more time to review |
Which issue does this PR close?
None
Rationale for this change
Users of
parquet_derive
currently have to write the schema of their Rust struct by hand.This is inconvenient, and can be generated for them.
What changes are included in this PR?
Adds
parquet::record::RecordWriter::schema()
trait member, and an implementation inparquet_derive
to generate a schema for the user.Are there any user-facing changes?
Yes, new API. The breakage is probably irrelevant as I don't think that there are many users of
parquet_derive
.