-
Notifications
You must be signed in to change notification settings - Fork 839
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal MapArray support #491
Conversation
82fec61
to
2d7ddd5
Compare
8be09ae
to
83d8406
Compare
Codecov Report
@@ Coverage Diff @@
## master #491 +/- ##
==========================================
+ Coverage 82.48% 82.52% +0.03%
==========================================
Files 167 168 +1
Lines 46450 47227 +777
==========================================
+ Hits 38315 38973 +658
- Misses 8135 8254 +119
Continue to review full report at Codecov.
|
@alamb @jorgecarleitao this is now ready for review |
@nevi-me I plan to review this over the next few days |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this whole PR -- and I think it is well written, well commented, and well tested and think it is good to merge in. I had some minor feedback but nothing I think would prevent merging.
Epic work @nevi-me 🏅
Regarding the "experimental API" -- I wonder if we could avoid it. For example, if we just merged this PR after arrow 5.0.0 is released (I hope to cut the branch / make an RC tomorrow) we can freely change the APIs on the master
branch until arrow 6.0.0 is cut and not worry about backwards compatibility
.add_buffer(entry_offsets) | ||
.add_child_data(entry_struct.data().clone()) | ||
.build(); | ||
let map_array = MapArray::from(map_data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to here, this method seems almost identical to create_from_buffers
-- perhaps it could be reused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made some slight differences between the two functions to test null values on the one map
mutable.child_data.iter_mut().for_each(|child| { | ||
child.extend( | ||
index, | ||
array.offset() + start, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use the array.offset()
? I don't understand this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comes from the struct slice fix, I had rebased it into the map support so it wouldn't block my progress. I've rebased now that the PR is merged.
.expect("Failed to read into array!"); | ||
|
||
for batch in record_batch_reader { | ||
batch.unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it would be valuable to check the contents of the record batches that were read out (in addition to checking they are not errors)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit tricky @alamb, because there's no output data to compare what's read to. The parquet test data that's used by parquet-mr, parquet-cpp and parquet-rs don't have equivalent JSON files like other integration tests.
So, the next best thing has been to read the files, and confirm that the reader doesn't error out.
map_field.data_type() | ||
); | ||
} | ||
} | ||
DataType::FixedSizeList(_, _) => unimplemented!(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be useful to test in this module as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@alamb I'm working on addressing your feedback, will update this PR in a few hours |
This commit adds the MapArray and MapBuilder. The interfaces are however incomplete at this stage.
A map is a list with some specific rules, so for equality it is the same as a list
Avoids the generic case of list > struct > [ley, value], which adds overhead
@alamb I've addressed your review, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the last two commits and they look great @nevi-me !
thank you @nevi-me ! |
@alamb, do you think it would be possible to include this in the next 5.x release? I forgot that we never added this PR to active releases, so it hasn't been released yet. |
If this introduces breaking API changes, I think the current policy is we will have to wait for 6.x? |
As @houqp mentions, I think this one can not be included into 5.x without breaking compatibility. Specifically, the introduction of |
Which issue does this PR close?
Related #395 .
Rationale for this change
To add support for the map array :)
What changes are included in this PR?
This adds minimal support for
arrow::array::MapArray
and Parquet map support.Includes
arrow::datatype::Map(Box<Field>, bool /* keys_sorted */
)arrow::array::MapArray
arrow::array::MapArrayBuilder
. Note, this API was just to make progress, so more work is neededarrow::ipc::{reader, writer}
supportarrow::json::reader
supportFuture work
See #538
Are there any user-facing changes?
Yes, new data type, enum, etc. Builds shall break