Refactor size estimation of Hashset into a function #8764

alamb · 2024-01-05T15:26:13Z

          maybe we can (as a follow on PR) put this logic into its own function (with comments) as estimating the size of hashbrown hashtables is likely to come up again

Originally posted by @alamb in #8721 (comment)

The text was updated successfully, but these errors were encountered:

alamb · 2024-01-05T15:27:25Z

This would be a nice way to improve the datafusion codebase without having to know too much about it.

yyy1000 · 2024-01-07T06:56:04Z

I'd like to work on this as a good start. :)

alamb · 2024-01-07T12:47:41Z

Thank you @yyy1000 !

yyy1000 · 2024-01-07T19:34:58Z

Hi, @alamb
I create #8779 that would to close this issue.
I'm pretty new to this project and would like any review and change if needed!

alamb · 2024-01-08T17:42:06Z

Thanks @yyy1000 -- I plan to review #8779 later today

marvinlanhenke · 2024-05-30T15:42:00Z

@alamb this still seems to be unresolved?

yyy1000 · 2024-05-30T15:56:54Z

@alamb this still seems to be unresolved?

Yes, I tried to help it but failed.🥲
You can pick it up.

marvinlanhenke · 2024-05-31T08:35:25Z

@alamb @yyy1001
I've taken a look into this and I agree with the comments @yyy1001 made on the original PR here and here.

I think the main issue here is the point of time when the collection (HashSet/HashTable) is created and the size estimation is done or needed. In count_distinct/native.rs we already have access to the collection to perform the estimation, whereas in hash_join.rs the estimation is done prior to the creation. This also leads to differences regarding the use of std::mem::size_of vs std::mem::size_of_val()

In order to make it reusable we'd need to introduce some params; and unfortunately cannot go full generic (?):

Here is an idea / outline:

fn estimate_memory_size<T>(num_elements: usize, fixed_size: usize) -> usize {
    let estimated_buckets =
        (num_elements.checked_mul(8).unwrap_or(usize::MAX) / 7).next_power_of_two();

    // fixed size part of memory allocation
    // 1 byte per number of bucket and the size of the
    // collection (HashSet, HashTable); if used before collection is created
    // we can only estimate the fixed size part of the collection.
    let fix = fixed_size + estimated_buckets;
    // variable size part of memory allocation
    // size of entry type multiplied with the number of estimated buckets.
    let var = std::mem::size_of::<T>() * estimated_buckets;

    fix + var
}

Example usage in PrimitiveDistinctCountAccumulator - fn size()

fn size(&self) -> usize {
    let fixed_size =
        std::mem::size_of_val(self) + std::mem::size_of_val(&self.values);
    estimate_memory_size::<T::Native>(self.values.len(), fixed_size)
}

Example usage in hash_join.rs:

let fixed_size = std::mem::size_of::<JoinHashMap>();
let estimated = estimate_memory_size::<(u64,u64)>(num_rows, fixed_size);

@alamb let me know what you think. Although, this might not be the best abstraction (since we need to provide the fixed_size) it might still have its benefits - in terms of consistency, maintainabiltiy, and testability? If you think its worth it, I'd proceed with implementing it in datafusion_common (anywhere specific here?).

alamb · 2024-05-31T13:31:47Z

@alamb let me know what you think.

Thank you @marvinlanhenke -- I think the idea is valuable

Although, this might not be the best abstraction (since we need to provide the fixed_size) it might still have its benefits - in terms of consistency, maintainabiltiy, and testability?

I think a large benefit would be the documentation about what the parameters meant (given this is so triky)

If you think its worth it, I'd proceed with implementing it in datafusion_common (anywhere specific here?).

Thanks! I recommend https://github.com/apache/datafusion/tree/main/datafusion/common/src/utils (perhaps https://github.com/apache/datafusion/blob/main/datafusion/common/src/utils/proxy.rs but rename proxy.rs to memory or something 🤔 )

marvinlanhenke · 2024-05-31T14:04:04Z

...guess I can get to this after the weekend.

yyy1000 · 2024-05-31T14:34:50Z

...guess I can get to this after the weekend.

Also plz let me know what I could help you from the original PR if you have similar issue 😊.

marvinlanhenke · 2024-06-01T14:26:59Z

...while working on this I noticed another issue in the current impl /count_distinct/native.rs

Although we perform a checked_mul when estimating buckets; we unwrap with usize::MAX which leads to a high number of estimated buckets. Later we perform another multiplication with the size of the entry type which can also overflow.

I think returning and error and changing the signature to Result<usize> instead of unwrapping would be the cleanest solution? However this requires a lot of changes for the trait Accumulator and all its implementors;

I'm not sure we want to do this change for an edge-case like this? The other option would be to cap the estimation at usize::MAX or simply panic?

@alamb @yyy1000 WDYT?

    fn size(&self) -> usize {
        let estimated_buckets = (self.values.len().checked_mul(8).unwrap_or(usize::MAX)
            / 7)
        .next_power_of_two();

        // This can still overflow
        std::mem::size_of_val(self)
            + std::mem::size_of::<T::Native>() * estimated_buckets
            + estimated_buckets
            + std::mem::size_of_val(&self.values)
    }

alamb added the good first issue Good for newcomers label Jan 5, 2024

yyy1000 mentioned this issue Jan 7, 2024

Refactor size estimation of Hashset into a function #8779

Closed

marvinlanhenke mentioned this issue Jun 1, 2024

Minor: Refactor memory size estimation for HashTable #10748

Merged

alamb closed this as completed in #10748 Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor size estimation of Hashset into a function #8764

Refactor size estimation of Hashset into a function #8764

alamb commented Jan 5, 2024

alamb commented Jan 5, 2024

yyy1000 commented Jan 7, 2024

alamb commented Jan 7, 2024

yyy1000 commented Jan 7, 2024

alamb commented Jan 8, 2024

marvinlanhenke commented May 30, 2024

yyy1000 commented May 30, 2024

marvinlanhenke commented May 31, 2024 •

edited

Loading

alamb commented May 31, 2024

marvinlanhenke commented May 31, 2024

yyy1000 commented May 31, 2024

marvinlanhenke commented Jun 1, 2024 •

edited

Loading

Refactor size estimation of Hashset into a function #8764

Refactor size estimation of Hashset into a function #8764

Comments

alamb commented Jan 5, 2024

alamb commented Jan 5, 2024

yyy1000 commented Jan 7, 2024

alamb commented Jan 7, 2024

yyy1000 commented Jan 7, 2024

alamb commented Jan 8, 2024

marvinlanhenke commented May 30, 2024

yyy1000 commented May 30, 2024

marvinlanhenke commented May 31, 2024 • edited Loading

alamb commented May 31, 2024

marvinlanhenke commented May 31, 2024

yyy1000 commented May 31, 2024

marvinlanhenke commented Jun 1, 2024 • edited Loading

marvinlanhenke commented May 31, 2024 •

edited

Loading

marvinlanhenke commented Jun 1, 2024 •

edited

Loading