-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correct array memory usage calculation for dictionary arrays #505
Conversation
There appears to be some failures in the test suite https://github.com/apache/arrow-rs/pull/505/checks?check_run_id=2934625537 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks like a great change @jhorstmann -- and the tests are 👨🍳 ❤️
Fixing bugs by deleting code is the mark of a great engineer in my opinion.
While reviewing the code, the only thing that stands out to me is the calculation of array memory size may still be inaccurate (we can fix as a follow on issue / PR too)
// Calculate size of the fields that don't have [get_array_memory_size] method internally.
size += mem::size_of_val(self)
- mem::size_of_val(&self.buffers) // <--- this is a Vec so it doesn't have an `array_memory_size` so it probably should not be included
- mem::size_of_val(&self.null_bitmap)
- mem::size_of_val(&self.child_data);
|
||
// substract empty array to avoid magic numbers for the size of additional fields | ||
assert_eq!( | ||
arr.get_array_memory_size() - empty.get_array_memory_size(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a cool calculation 👍
That is what I get when not running the full test suite after a seemingly small change :) While looking into it I also noticed the problem you mentioned with fields that don't have [get_array_memory_size] themselves, the overhead of these fields (ptr, len, capacity) does not seem to be accounted for. |
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
let mut size = 0; | ||
// Calculate size of the fields that don't have [get_array_memory_size] method internally. | ||
size += mem::size_of_val(self) | ||
- mem::size_of_val(&self.buffers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since child_data
and null_bitmap
include the size of self
in the results of bitmap.get_array_memory_size()
and child.get_array_memory_size()
I think we still need to subtract them off.
Perhaps a pattern such as edited
if let Some(bitmap) = &self.null_bitmap {
size += bitmap.get_array_memory_size()
size -= mem::size_of_val(&bitmap);
}
would make the intent clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think this applies only to the null_bitmap since it is embedded directly without a heap allocation in between, but have to think about this a bit more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be fixed now, I added one more assertion for this case:
// expected size is the size of the PrimitiveArray struct,
// which includes the optional validity buffer
// plus one buffer on the heap
assert_eq!(
std::mem::size_of::<PrimitiveArray<Int64Type>>() + std::mem::size_of::<Buffer>(),
empty_with_bitmap.get_array_memory_size()
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
embedded directly without a heap allocation in between.
I double checked too and I agree it now looks good. This whole notion is confusing -- I think the tests you have added will really help in the future too
Codecov Report
@@ Coverage Diff @@
## master #505 +/- ##
==========================================
+ Coverage 82.64% 82.74% +0.09%
==========================================
Files 165 165
Lines 45703 45686 -17
==========================================
+ Hits 37773 37804 +31
+ Misses 7930 7882 -48
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thank you @jhorstmann
@jorgecarleitao / @nevi-me / @Dandandan any thoughts?
let mut size = 0; | ||
// Calculate size of the fields that don't have [get_array_memory_size] method internally. | ||
size += mem::size_of_val(self) | ||
- mem::size_of_val(&self.buffers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
embedded directly without a heap allocation in between.
I double checked too and I agree it now looks good. This whole notion is confusing -- I think the tests you have added will really help in the future too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Great simplification.
* Correct array memory usage calculation for dictionary arrays * update comments Co-authored-by: Andrew Lamb <[email protected]> * update comments Co-authored-by: Andrew Lamb <[email protected]> * Adjust memory usage calculation and move related tests to the same file * Comment about bitmap size * Clippy fixes * Adjust calculation for validity bitmap Co-authored-by: Andrew Lamb <[email protected]>
…515) * Correct array memory usage calculation for dictionary arrays * update comments Co-authored-by: Andrew Lamb <[email protected]> * update comments Co-authored-by: Andrew Lamb <[email protected]> * Adjust memory usage calculation and move related tests to the same file * Comment about bitmap size * Clippy fixes * Adjust calculation for validity bitmap Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Jörn Horstmann <[email protected]>
Which issue does this PR close?
Closes #503.
Rationale for this change
Some Array implementations have fields for easier access to nested arrays. These point to the same data already contained in ArrayData and so should not be considered to increase the memory usage. Since all the buffers are contained in ArrayData we can instead delegate the whole calculation and remove the specialized implementations.
What changes are included in this PR?
Fixes overreporting of memory usage in DictionaryArray, FixedSizeListArray and UnionArray and simplified the code.
Are there any user-facing changes?
get_array_memory_size
returns slightly different but more correct results