You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Calling jaccard_index on long strings leads to OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit
File /opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/column/string.py:5378, in StringMethods.jaccard_index(self, input, width)
5353 def jaccard_index(self, input: cudf.Series, width: int) -> SeriesOrIndex:
5354 """
5355 Compute the Jaccard index between this column and the given
5356 input strings column.
(...)
5374 dtype: float32
5375 """
5377 return self._return_or_inplace(
-> 5378 libstrings.jaccard_index(self._column, input._column, width),
5379 )
File /opt/conda/envs/rapids/lib/python3.10/contextlib.py:79, in ContextDecorator.__call__.<locals>.inner(*args, **kwds)
76 @wraps(func)
77 def inner(*args, **kwds):
78 with self._recreate_cm():
---> 79 return func(*args, **kwds)
File jaccard.pyx:26, in cudf._lib.nvtext.jaccard.jaccard_index()
OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit
Expected behavior
Perhaps it is expected for long string to not work with this method since I don't see it on the #13048, but it would good to get conformation.
Environment overview (please complete the following information)
Environment location: Bare-metal
Method of cuDF install: conda 24.08 nightly
If method of install is [Docker], provide docker pull & docker run commands used
Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered:
The jaccard API uses hash_character_ngrams internally which produces a list column of integer values. The total number of integers in that list column is the number of ngrams for this strings column. The number of integers exceeds the max size_type and so the function is unable to build the output list column.
So you would need to limit the strings column size so the total number of generated ngrams would not exceed max size_type/int32 individual strings.
Meanwhile I can work on a modifying jaccard to avoid this limit since it is an internal detail for that API.
Even with large-strings support the amount of memory needed to process this example will be significant.
The original df size is 11 rows of 214,748,364 bytes each = ~2.4GB for the total input strings size.
Using a width=5 means each row generates 214,748,368 individual substrings at 5 bytes each = ~1.1GB per row. (11 rows ~ 12GB). The internal code uses hashing which reduces the 5 bytes to 4 bytes = ~859MB per row. (11 rows ~ 9.5GB).
Since the jaccard call here in this example is comparing the df with itself the temporary memory doubles to ~19GB.
Internally the intermediate substrings/hashes are sorted to help with counting the unique values. The sorted output requires a 2nd temporary copy (of the 9.5GB) which gets us to (19+9.5) = 28.5GB peak memory.
So overall jaccard_index would need about 6x the input memory available for processing.
Describe the bug
Calling jaccard_index on long strings leads to
OverflowError: CUDF failure at: /opt/conda/conda-bld/work/cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh:323: Size of output exceeds the column size limit
Steps/Code to reproduce bug
Results in:
Expected behavior
Perhaps it is expected for long string to not work with this method since I don't see it on the #13048, but it would good to get conformation.
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsAdditional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: