Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor: Refactor memory size estimation for HashTable #10748

Merged
merged 8 commits into from
Jun 3, 2024

Conversation

marvinlanhenke
Copy link
Contributor

Which issue does this PR close?

Closes #8764.

Rationale for this change

As stated in the issue here the goal is to consolidate estimation logic into a single function. This allows for better documentation as well as better maintainability.
Also there was an issue with the existing implementation that could still overflow if the checked_mul was capped at usize::MAX. I tried to fix this by checking for overflows and returning usize::MAX as a 'max cap'.

What changes are included in this PR?

  • extract logic into single function
  • fixed missing overflow checks
  • moved logic into datfusion common
  • used the new function in count_distinct/native.rs and hash_join.rs

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the physical-expr Physical Expressions label Jun 1, 2024
@marvinlanhenke
Copy link
Contributor Author

@alamb @yyy1000 PTAL

Copy link
Contributor

@yyy1000 yyy1000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it's good abstraction.

My only concern is, it seems that the original goal was kind of ideal, which just pass the hashtable and get the estimate size. But now before using this function, user have to calculate the fixed_size themself, and also the calculation is not always the same.

Though the abstraction is not pretty enough, I think given the current code this PR is still an improvement. :)

@marvinlanhenke
Copy link
Contributor Author

I see it's good abstraction.

My only concern is, it seems that the original goal was kind of ideal, which just pass the hashtable and get the estimate size. But now before using this function, user have to calculate the fixed_size themself, and also the calculation is not always the same.

Though the abstraction is not pretty enough, I think given the current code this PR is still an improvement. :)

Yes, I totally agree with you on this. However, as discussed in the issue, it still might be worth it due to the improved documentation.

@yyy1000
Copy link
Contributor

yyy1000 commented Jun 1, 2024

However, as discussed in the issue, it still might be worth it due to the improved documentation.

Agree! I left a comment which may improve but it's not necessary. Thanks!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @marvinlanhenke -- this looks like a great step forward in readability 🙏

Thank you @yyy1000 for your review

I left some suggestions / comments. Let me know what you think

datafusion/common/src/utils/memory.rs Show resolved Hide resolved
datafusion/common/src/utils/memory.rs Show resolved Hide resolved
datafusion/common/src/utils/memory.rs Outdated Show resolved Hide resolved
datafusion/common/src/utils/memory.rs Outdated Show resolved Hide resolved
datafusion/common/src/utils/memory.rs Show resolved Hide resolved
@marvinlanhenke
Copy link
Contributor Author

@alamb
I made the changes based on your comments. Thanks for the comprehensive review. 🚀

I changed fn estimate_memory_size(..) to return Result<usize> which should set us up for any possible future changes in the Accumulator trait; for now I simply unwrap in fn size() and panic.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @marvinlanhenke

@alamb alamb merged commit a92f803 into apache:main Jun 3, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 3, 2024

Thanks again @marvinlanhenke

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* refactor: extract estimate_memory_size

* refactor: cap at usize::MAX

* refactor: use estimate_memory_size

* chore: add examples

* refactor: return Result<usize>; add testcase

* fix: docs

* fix: remove unneccessary checked_div

* fix: remove additional and_then
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor size estimation of Hashset into a function
3 participants