-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor size estimation of Hashset into a function #8764
Comments
This would be a nice way to improve the datafusion codebase without having to know too much about it. |
I'd like to work on this as a good start. :) |
Thank you @yyy1000 ! |
@alamb this still seems to be unresolved? |
Yes, I tried to help it but failed.🥲 |
@alamb @yyy1001 I think the main issue here is the point of time when the collection (HashSet/HashTable) is created and the size estimation is done or needed. In In order to make it reusable we'd need to introduce some params; and unfortunately cannot go full generic (?): Here is an idea / outline: fn estimate_memory_size<T>(num_elements: usize, fixed_size: usize) -> usize {
let estimated_buckets =
(num_elements.checked_mul(8).unwrap_or(usize::MAX) / 7).next_power_of_two();
// fixed size part of memory allocation
// 1 byte per number of bucket and the size of the
// collection (HashSet, HashTable); if used before collection is created
// we can only estimate the fixed size part of the collection.
let fix = fixed_size + estimated_buckets;
// variable size part of memory allocation
// size of entry type multiplied with the number of estimated buckets.
let var = std::mem::size_of::<T>() * estimated_buckets;
fix + var
} Example usage in fn size(&self) -> usize {
let fixed_size =
std::mem::size_of_val(self) + std::mem::size_of_val(&self.values);
estimate_memory_size::<T::Native>(self.values.len(), fixed_size)
} Example usage in let fixed_size = std::mem::size_of::<JoinHashMap>();
let estimated = estimate_memory_size::<(u64,u64)>(num_rows, fixed_size); @alamb let me know what you think. Although, this might not be the best abstraction (since we need to provide the fixed_size) it might still have its benefits - in terms of consistency, maintainabiltiy, and testability? If you think its worth it, I'd proceed with implementing it in |
Thank you @marvinlanhenke -- I think the idea is valuable
I think a large benefit would be the documentation about what the parameters meant (given this is so triky)
Thanks! I recommend https://github.com/apache/datafusion/tree/main/datafusion/common/src/utils (perhaps https://github.com/apache/datafusion/blob/main/datafusion/common/src/utils/proxy.rs but rename |
...guess I can get to this after the weekend. |
Also plz let me know what I could help you from the original PR if you have similar issue 😊. |
...while working on this I noticed another issue in the current impl /count_distinct/native.rs Although we perform a checked_mul when estimating buckets; we unwrap with usize::MAX which leads to a high number of estimated buckets. Later we perform another multiplication with the size of the entry type which can also overflow. I think returning and error and changing the signature to I'm not sure we want to do this change for an edge-case like this? The other option would be to cap the estimation at usize::MAX or simply panic? fn size(&self) -> usize {
let estimated_buckets = (self.values.len().checked_mul(8).unwrap_or(usize::MAX)
/ 7)
.next_power_of_two();
// This can still overflow
std::mem::size_of_val(self)
+ std::mem::size_of::<T::Native>() * estimated_buckets
+ estimated_buckets
+ std::mem::size_of_val(&self.values)
} |
Originally posted by @alamb in #8721 (comment)
The text was updated successfully, but these errors were encountered: