-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove hash_character_ngrams dependency from jaccard_index #16241
Remove hash_character_ngrams dependency from jaccard_index #16241
Conversation
Benchmark shows some good improvements for long strings as well
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small comments but generally all looks fine.
cpp/src/text/jaccard.cu
Outdated
cudf::size_type const* d_uniques2; | ||
cudf::size_type const* d_intersects; | ||
|
||
__device__ float operator()(cudf::size_type idx) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It surprises me a bit that we return float
and not double
for this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is what is expected. I can double check though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed float
is correct for now.
/merge |
Description
Removes internal dependency of
nvtext::hash_character_ngrams
fromnvtext::jaccard_index
.Works around the size-type limit imposed by
hash_character_ngrams
which returns alist
column.This also specializes the hashing logic for the jaccard calculation specifically.
The overall algorithm has not changed. Code has moved around a bit and internal list-columns have been replaced with just offsets and values vectors.
Closes #16157
Checklist