Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor size estimation of Hashset into a function #8764

Closed
alamb opened this issue Jan 5, 2024 · 12 comments · Fixed by #10748
Closed

Refactor size estimation of Hashset into a function #8764

alamb opened this issue Jan 5, 2024 · 12 comments · Fixed by #10748
Labels
good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Jan 5, 2024

          maybe we can (as a follow on PR) put this logic into its own function (with comments) as estimating the size of hashbrown hashtables is likely to come up again

Originally posted by @alamb in #8721 (comment)

@alamb alamb added the good first issue Good for newcomers label Jan 5, 2024
@alamb
Copy link
Contributor Author

alamb commented Jan 5, 2024

This would be a nice way to improve the datafusion codebase without having to know too much about it.

@yyy1000
Copy link
Contributor

yyy1000 commented Jan 7, 2024

I'd like to work on this as a good start. :)

@alamb
Copy link
Contributor Author

alamb commented Jan 7, 2024

Thank you @yyy1000 !

@yyy1000
Copy link
Contributor

yyy1000 commented Jan 7, 2024

Hi, @alamb
I create #8779 that would to close this issue.
I'm pretty new to this project and would like any review and change if needed!

@alamb
Copy link
Contributor Author

alamb commented Jan 8, 2024

Thanks @yyy1000 -- I plan to review #8779 later today

@marvinlanhenke
Copy link
Contributor

@alamb this still seems to be unresolved?

@yyy1000
Copy link
Contributor

yyy1000 commented May 30, 2024

@alamb this still seems to be unresolved?

Yes, I tried to help it but failed.🥲
You can pick it up.

@marvinlanhenke
Copy link
Contributor

marvinlanhenke commented May 31, 2024

@alamb @yyy1001
I've taken a look into this and I agree with the comments @yyy1001 made on the original PR here and here.

I think the main issue here is the point of time when the collection (HashSet/HashTable) is created and the size estimation is done or needed. In count_distinct/native.rs we already have access to the collection to perform the estimation, whereas in hash_join.rs the estimation is done prior to the creation. This also leads to differences regarding the use of std::mem::size_of vs std::mem::size_of_val()

In order to make it reusable we'd need to introduce some params; and unfortunately cannot go full generic (?):

Here is an idea / outline:

fn estimate_memory_size<T>(num_elements: usize, fixed_size: usize) -> usize {
    let estimated_buckets =
        (num_elements.checked_mul(8).unwrap_or(usize::MAX) / 7).next_power_of_two();

    // fixed size part of memory allocation
    // 1 byte per number of bucket and the size of the
    // collection (HashSet, HashTable); if used before collection is created
    // we can only estimate the fixed size part of the collection.
    let fix = fixed_size + estimated_buckets;
    // variable size part of memory allocation
    // size of entry type multiplied with the number of estimated buckets.
    let var = std::mem::size_of::<T>() * estimated_buckets;

    fix + var
}

Example usage in PrimitiveDistinctCountAccumulator - fn size()

fn size(&self) -> usize {
    let fixed_size =
        std::mem::size_of_val(self) + std::mem::size_of_val(&self.values);
    estimate_memory_size::<T::Native>(self.values.len(), fixed_size)
}

Example usage in hash_join.rs:

let fixed_size = std::mem::size_of::<JoinHashMap>();
let estimated = estimate_memory_size::<(u64,u64)>(num_rows, fixed_size);

@alamb let me know what you think. Although, this might not be the best abstraction (since we need to provide the fixed_size) it might still have its benefits - in terms of consistency, maintainabiltiy, and testability? If you think its worth it, I'd proceed with implementing it in datafusion_common (anywhere specific here?).

@alamb
Copy link
Contributor Author

alamb commented May 31, 2024

@alamb let me know what you think.

Thank you @marvinlanhenke -- I think the idea is valuable

Although, this might not be the best abstraction (since we need to provide the fixed_size) it might still have its benefits - in terms of consistency, maintainabiltiy, and testability?

I think a large benefit would be the documentation about what the parameters meant (given this is so triky)

If you think its worth it, I'd proceed with implementing it in datafusion_common (anywhere specific here?).

Thanks! I recommend https://github.com/apache/datafusion/tree/main/datafusion/common/src/utils (perhaps https://github.com/apache/datafusion/blob/main/datafusion/common/src/utils/proxy.rs but rename proxy.rs to memory or something 🤔 )

@marvinlanhenke
Copy link
Contributor

...guess I can get to this after the weekend.

@yyy1000
Copy link
Contributor

yyy1000 commented May 31, 2024

...guess I can get to this after the weekend.

Also plz let me know what I could help you from the original PR if you have similar issue 😊.

@marvinlanhenke
Copy link
Contributor

marvinlanhenke commented Jun 1, 2024

...while working on this I noticed another issue in the current impl /count_distinct/native.rs

Although we perform a checked_mul when estimating buckets; we unwrap with usize::MAX which leads to a high number of estimated buckets. Later we perform another multiplication with the size of the entry type which can also overflow.

I think returning and error and changing the signature to Result<usize> instead of unwrapping would be the cleanest solution? However this requires a lot of changes for the trait Accumulator and all its implementors;

I'm not sure we want to do this change for an edge-case like this? The other option would be to cap the estimation at usize::MAX or simply panic?

@alamb @yyy1000 WDYT?

    fn size(&self) -> usize {
        let estimated_buckets = (self.values.len().checked_mul(8).unwrap_or(usize::MAX)
            / 7)
        .next_power_of_two();

        // This can still overflow
        std::mem::size_of_val(self)
            + std::mem::size_of::<T::Native>() * estimated_buckets
            + estimated_buckets
            + std::mem::size_of_val(&self.values)
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants