-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support writing nested lists to parquet #1746
Conversation
This actually has an issue with nulls in struct arrays with non-null slices in child list arrays... Working on fixing... |
94b30f0
to
b50ba82
Compare
b50ba82
to
9ede644
Compare
I think this is now ready for review, going to work on getting a few more esoteric tests, but I think the meat of it is ready |
|
||
#[test] | ||
fn test_calculate_array_levels_twitter_example() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm currently working through porting these tests
Codecov Report
@@ Coverage Diff @@
## master #1746 +/- ##
==========================================
+ Coverage 83.27% 83.46% +0.18%
==========================================
Files 195 196 +1
Lines 55896 55895 -1
==========================================
+ Hits 46549 46653 +104
+ Misses 9347 9242 -105
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That certainly seems to have made the code simpler, as measured by lines ;) Thank you for the efforts @tustvold
I also read all the code carefully. While I can't say I got every line I would feel much better about debugging and maintaining the code after this PR than before.
There appear to be some non trivial changes to the tests and the cases that are covered -- I wonder if this was intentional or if you are still working to port remaining tests?
Also I don't really have a sense for any end to end testing we have (like do we have tests that round trip StructArray
s to and back from parquet? That might be worth writing if we don't have them yet (as a follow on PR??)
cc @nevi-me and @helgikrs in case you have time or interest in reviewing this code
I wonder if @ahmedriza has time to try his reproducer out with the code in this PR?
parquet/src/arrow/arrow_writer.rs
Outdated
let indices = levels.filter_array_indices(); | ||
// Slice array according to computed offset and length | ||
let column = column.slice(levels.offset, levels.length); | ||
// TODO: Avoid filtering if no need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be a follow on ticket to track? It doesn't seem like this PR is any worse in the delpartment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a hold over from an intermediate version of this PR, you can only avoid doing the "take" operation if there are no-nulls which is relatively rare, will remove
/// i.e. a leaf array with no children | ||
fn is_leaf(data_type: &DataType) -> bool { | ||
matches!( | ||
data_type, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would be more future proof if it were !matches!(DataType::Dictionary....
) -- aka explicitly list out the types that are handled in the rest of this module rather than trying to enumerate the converse.
I am just thinking about what would happen if someone added a new non leaf type to arrow's DataType
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow? The intent is that we can match just dictionaries where the value type is a leaf, as this can be handled transparently. We don't support complex dictionary value types (and for what it is worth neither does Arrow C++). We used to simply assume that dictionary types were primitive, this is completing a TODO I encountered in the code - https://github.com/apache/arrow-rs/pull/1746/files/b104aba76a6868e4296eb0f5a7b4fd0b8960eb62#diff-4b715628e2e0ae6f66e590227d9587cd5f2155055a59ad6b7b0dc7b1914ad8edL316
Edit: Oh I see what you're saying, I think if a new arrow type were added we would likely need additional work in the parquet writer to support it, and so I think not automatically supporting it is probably safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: Oh I see what you're saying, I think if a new arrow type were added we would likely need additional work in the parquet writer to support it, and so I think not automatically supporting it is probably safer.
Yes, thank you -- this is what I was trying to say 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't support complex dictionary value types (and for what it is worth neither does Arrow C++)
I didn't realize that. TIL!
/// A helper to construct [`LevelInfo`] from a potentially nested [`Field`] | ||
enum LevelInfoBuilder { | ||
Primitive(LevelInfo), | ||
List(Box<LevelInfoBuilder>, LevelContext), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might help to add some docstrings here about what the Box<LevelInfoBuilder>
and Vec<LevelBuilder>
represent
parquet/src/arrow/levels.rs
Outdated
for _ in 0..len { | ||
def_levels.push(ctx.def_level - 1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this would more "idomatic" if it were written like:
def_levels
.extend(
std::iter::repeat(ctx.def_level - 1)
.take(len)
)
? Though I admit the current code seems more readable 👍
parquet/src/arrow/levels.rs
Outdated
let mut last_non_null_idx = None; | ||
let mut last_null_idx = None; | ||
|
||
// TODO: BitChunkIterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a TODO for this PR or for a follow on?
}; | ||
let array_offsets = vec![0, 2, 2, 4, 8, 11]; | ||
let array_mask = vec![true, false, true, true, true]; | ||
let leaf_array = Int32Array::from_iter([0, 0, 2, 2, 3, 3, 3, 3, 4, 4, 4]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we seem to have lost a substantial number of verifications in this test. Is that intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The diff is being unhelpful, once I've ported the remaining tests it should sort itself out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will check it again when you are done porting the test
parquet/src/arrow/levels.rs
Outdated
offset: 0, | ||
length: 5, | ||
fn test_struct_mask_list() { | ||
// Test a struct array masking a list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what "masking a list" means in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reworded to hopefully make it clearer
This is fantastic. Checked my test cases against this PR and they work perfectly now. Also checked a few other combinations and they are all handled correctly now.
Amazing work. Thanks for the quick fix. |
// 2 3 [4] are 0 | ||
// 4 5 6 7 [8] are 1 (defined at level 1 only) | ||
// 8 9 10 [11] are 2 (defined at both levels) | ||
definition: vec![0, 0, 1, 0, 0, 3, 3, 3, 3, 3, 3, 3], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is actually wrong, the levels are actually incoherent. This can be clearly seen from the first two levels.
The definition levels are [0, 0]
implying it is null at the root for both, and yet it has repetition levels of [0, 1]
which suggests a non-null slice of two elements, these are clearly not compatible.
I suspect the old code is not correctly handling a struct array masking a non-empty list slice of a child, and is producing a level for each entry in the child slice, instead of a single level for the struct's null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was trying to express here was the list's values even though the struct is null at that point. Hence you see d: [0, 0], r: [0, 1]
to say that the value has 2 slots, which are all null due to the struct being null at that point.
Perhaps my oversight was not considering that this might not matter at all, and it's safer/better to do d: [0], r: [0]
.
What limited me a lot was my indexing logic, because given an child array of 11 values, producing fewer than 11 repetition levels would lose the offset information from the Arrow side.
Looking at your solution, you have 10 values vs my 12 (11 values and an extra value to encode the offset [2, 2]
in the list). When slicing into the child list doesn't cause issues, your solution is fine/better.
I suppose the root problem is Arrow's logical null rules. You could have a 1GB record with all sorts of values, but if you wrap that against a null struct (e.g. struct<all the values>
), then you get different behaviour:
- IPC roundtrip would preserve the values
- Writing to and reading from Parquet would drop all those null values.
I was trying to preserve data, but maybe while looking at the minor details, I ended up missing the bigger picture.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, handling structs masking lists tripped me up the first time also, I had to completely rewrite my rewrite to support it 😆
I think the important property is what you read out is "logically-equal" to what you wrote in, even if it doesn't have the exact same data in the backing buffers. StructArrays are one example where this would occur, but a sliced array would be another example where the underlying arrow buffers are not preserved. This is in fact critical to how the ArrowWriter batches row groups by slicing arrays.
IPC roundtrip would preserve the values
TBH I view this as a potentially undesirable implementation quirk than a desirable property, see #208.
I suppose the root problem is Arrow's logical null rules
I think you could be more general and say arrow has a relaxed approach to the backing representation of an array. This allows it to avoid copies, but comes at the cost of things like the never-ending bug-source that is ArrayData::offset() 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's true. We also had a lot more logical-null bugs (esp slicing) at the time that I wrote the level logic. So it was harder to avoid the undesirable quirks at the time.
If one will have to materialise data, they might as well remove data that's been made null due to masking, so the new approach is more desirable.
// 3: [[108, 109], [110, 111], [112, 113], [114, 115]] | ||
// 4: [[116, 117], [118, 119], [120, 121]] | ||
definition: vec![ | ||
0, 0, 0, 0, 1, 0, 0, 0, 0, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, these definition levels are wrong. In the structure above the only non-max definition level will come from the empty slice.
Ok I think I've ported all the tests, there are some omissions because we no longer compute the levels for non-leaf arrays (whatever that actually means), and |
🎉 arrow 15 is going to be a great release |
I was still going through it, looks great, I hadn't seen anything that stood out, so I'm happy with the merge |
Oh sorry, please let me know if you see anything and I'll be more than happy to address in a follow on PR 😅 |
I only added a comment explaining why the def/rep for |
Worth advertising @tustvold @alamb is at least 30% improvement in the writer due to this PR. At least that's on an M1 Pro (now I'm curious about x86) write_batch primitive/4096 values primitive
time: [1.1237 ms 1.1270 ms 1.1303 ms]
thrpt: [156.08 MiB/s 156.54 MiB/s 156.99 MiB/s]
change:
time: [-29.494% -28.991% -28.501%] (p = 0.00 < 0.05)
thrpt: [+39.863% +40.828% +41.831%]
Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
1 (1.00%) high mild
4 (4.00%) high severe
write_batch primitive/4096 values primitive non-null
time: [979.35 us 982.76 us 986.33 us]
thrpt: [178.86 MiB/s 179.51 MiB/s 180.13 MiB/s]
change:
time: [-30.346% -29.722% -29.002%] (p = 0.00 < 0.05)
thrpt: [+40.849% +42.291% +43.567%]
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
3 (3.00%) high mild
4 (4.00%) high severe
write_batch primitive/4096 values bool
time: [62.749 us 63.655 us 64.860 us]
thrpt: [17.527 MiB/s 17.859 MiB/s 18.116 MiB/s]
change:
time: [-33.845% -32.730% -31.281%] (p = 0.00 < 0.05)
thrpt: [+45.520% +48.655% +51.159%]
Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
write_batch primitive/4096 values bool non-null
time: [40.799 us 40.848 us 40.901 us]
thrpt: [27.793 MiB/s 27.830 MiB/s 27.863 MiB/s]
change:
time: [-45.958% -45.620% -45.294%] (p = 0.00 < 0.05)
thrpt: [+82.796% +83.891% +85.043%]
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
write_batch primitive/4096 values string
time: [627.92 us 634.23 us 642.61 us]
thrpt: [123.85 MiB/s 125.49 MiB/s 126.75 MiB/s]
change:
time: [-23.875% -23.318% -22.771%] (p = 0.00 < 0.05)
thrpt: [+29.486% +30.408% +31.362%]
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
8 (8.00%) high mild
3 (3.00%) high severe
write_batch primitive/4096 values string non-null
time: [617.66 us 618.60 us 619.56 us]
thrpt: [128.46 MiB/s 128.66 MiB/s 128.86 MiB/s]
change:
time: [-26.558% -25.760% -25.150%] (p = 0.00 < 0.05)
thrpt: [+33.601% +34.699% +36.162%]
Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
2 (2.00%) high mild
3 (3.00%) high severe
Benchmarking write_batch nested/4096 values primitive list: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60.
write_batch nested/4096 values primitive list
time: [1.2680 ms 1.2695 ms 1.2709 ms]
thrpt: [128.87 MiB/s 129.01 MiB/s 129.16 MiB/s]
change:
time: [-33.645% -33.420% -33.197%] (p = 0.00 < 0.05)
thrpt: [+49.694% +50.195% +50.705%]
Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) low severe
6 (6.00%) low mild
2 (2.00%) high mild
5 (5.00%) high severe
Benchmarking write_batch nested/4096 values primitive list non-null: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.2s, enable flat sampling, or reduce sample count to 50.
write_batch nested/4096 values primitive list non-null
time: [1.6240 ms 1.6347 ms 1.6475 ms]
thrpt: [117.49 MiB/s 118.41 MiB/s 119.19 MiB/s]
change:
time: [-29.010% -28.463% -27.890%] (p = 0.00 < 0.05)
thrpt: [+38.676% +39.788% +40.866%]
Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
8 (8.00%) high mild
5 (5.00%) high severe |
amd64, 3900x on a pcie4 SSD, less dramatic but still great Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.0s, enable flat sampling, or reduce sample count to 50.
write_batch primitive/4096 values primitive
time: [1.6179 ms 1.6298 ms 1.6462 ms]
thrpt: [107.17 MiB/s 108.24 MiB/s 109.04 MiB/s]
change:
time: [-31.768% -19.351% -7.9981%] (p = 0.00 < 0.05)
thrpt: [+8.6934% +23.994% +46.559%]
Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) high mild
3 (3.00%) high severe
Benchmarking write_batch primitive/4096 values primitive non-null: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.1s, enable flat sampling, or reduce sample count to 50.
write_batch primitive/4096 values primitive non-null
time: [1.3544 ms 1.3650 ms 1.3777 ms]
thrpt: [128.05 MiB/s 129.25 MiB/s 130.25 MiB/s]
change:
time: [-20.261% -14.076% -10.278%] (p = 0.00 < 0.05)
thrpt: [+11.455% +16.382% +25.409%]
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
4 (4.00%) high mild
6 (6.00%) high severe
write_batch primitive/4096 values bool
time: [98.617 us 99.677 us 101.01 us]
thrpt: [11.254 MiB/s 11.405 MiB/s 11.527 MiB/s]
change:
time: [-42.532% -30.107% -17.429%] (p = 0.00 < 0.05)
thrpt: [+21.108% +43.076% +74.009%]
Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
2 (2.00%) high mild
8 (8.00%) high severe
write_batch primitive/4096 values bool non-null
time: [51.438 us 51.798 us 52.269 us]
thrpt: [21.749 MiB/s 21.946 MiB/s 22.100 MiB/s]
change:
time: [-36.532% -34.935% -33.116%] (p = 0.00 < 0.05)
thrpt: [+49.513% +53.693% +57.560%]
Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
3 (3.00%) high mild
5 (5.00%) high severe
write_batch primitive/4096 values string
time: [771.00 us 790.87 us 832.66 us]
thrpt: [95.586 MiB/s 100.64 MiB/s 103.23 MiB/s]
change:
time: [-8.6599% -2.6996% +8.1085%] (p = 0.73 > 0.05)
thrpt: [-7.5003% +2.7745% +9.4810%]
No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) high mild
6 (6.00%) high severe
write_batch primitive/4096 values string non-null
time: [780.70 us 785.66 us 793.50 us]
thrpt: [100.30 MiB/s 101.30 MiB/s 101.95 MiB/s]
change:
time: [-11.689% -9.2061% -6.3254%] (p = 0.00 < 0.05)
thrpt: [+6.7525% +10.140% +13.236%]
Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
8 (8.00%) high mild
3 (3.00%) high severe
Benchmarking write_batch nested/4096 values primitive list: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.0s, enable flat sampling, or reduce sample count to 50.
write_batch nested/4096 values primitive list
time: [1.5257 ms 1.5327 ms 1.5410 ms]
thrpt: [106.28 MiB/s 106.86 MiB/s 107.35 MiB/s]
change:
time: [-18.698% -10.026% +0.4958%] (p = 0.04 < 0.05)
thrpt: [-0.4934% +11.143% +22.998%]
Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
4 (4.00%) high mild
8 (8.00%) high severe
write_batch nested/4096 values primitive list non-null
time: [1.9924 ms 2.0106 ms 2.0334 ms]
thrpt: [95.194 MiB/s 96.274 MiB/s 97.150 MiB/s]
change:
time: [-11.857% -10.947% -9.8362%] (p = 0.00 < 0.05)
thrpt: [+10.909% +12.292% +13.452%]
Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) high mild
6 (6.00%) high severe |
Which issue does this PR close?
Closes #1744.
Rationale for this change
The existing logic to handle nested types was incredibly convoluted, performed a lot of unnecessary work when writing non-nested types, and it was unclear how it could be made to handle more than one layer of nested lists.
What changes are included in this PR?
This PR therefore reworks the levels computation logic to hopefully be easier to understand, and support arbitrary nesting of lists.
Are there any user-facing changes?
No