Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: improve docs for MinHashDedup Step #1050

Merged
merged 2 commits into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions src/distilabel/steps/filtering/_datasketch.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@
# limitations under the License.

"""
`dataskech` (https://github.com/ekzhu/datasketch) doesn't offer a way to store the hash tables in disk. This
is a custom implementation that uses `shelve` to store the hash tables in disk.
`datasketch` (https://github.com/ekzhu/datasketch) doesn't offer a way to store the hash tables in disk. This
is a custom implementation that uses `diskcache` to store the hash tables in disk.
Note: This implementation is not optimized for performance, but could be worth
creating a PR to `datasketch`.
"""
Expand Down Expand Up @@ -98,15 +98,15 @@ def insert(self, key, *vals, **kwargs):


def ordered_storage(config, name=None):
"""Copy of `datasketch.storage.ordered_storage` with the addition of `ShelveListStorage`."""
"""Copy of `datasketch.storage.ordered_storage` with the addition of `DiskCacheListStorage`."""
tp = config["type"]
if tp == "disk":
return DiskCacheListStorage(config, name=name)
return _ordered_storage(config, name=name)


def unordered_storage(config, name=None):
"""Copy of `datasketch.storage.ordered_storage` with the addition of `ShelveSetStorage`."""
"""Copy of `datasketch.storage.ordered_storage` with the addition of `DiskCacheSetStorage`."""
tp = config["type"]
if tp == "disk":
return DiskCacheSetStorage(config, name=name)
Expand Down
9 changes: 3 additions & 6 deletions src/distilabel/steps/filtering/minhash.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,12 +92,11 @@ class MinHashDedup(Step):

Attributes:
num_perm: the number of permutations to use. Defaults to `128`.
seed: the seed to use for the MinHash. This seed must be the same
used for `MinHash`, keep in mind when both steps are created. Defaults to `1`.
seed: the seed to use for the MinHash. Defaults to `1`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this detail is not needed after #937

tokenizer: the tokenizer to use. Available ones are `words` or `ngrams`.
If `words` is selected, it tokenize the text into words using nltk's
If `words` is selected, it tokenizes the text into words using nltk's
word tokenizer. `ngram` estimates the ngrams (together with the size
`n`) using. Defaults to `words`.
`n`). Defaults to `words`.
n: the size of the ngrams to use. Only relevant if `tokenizer="ngrams"`. Defaults to `5`.
threshold: the threshold to consider two MinHashes as duplicates.
Values closer to 0 detect more duplicates. Defaults to `0.9`.
Expand All @@ -106,8 +105,6 @@ class MinHashDedup(Step):
not defined in `datasketch`, that is based on DiskCache's `Index` class.
It should work as a `dict`, but backed by disk, but depending on the system
it can be slower. Defaults to `dict`.
which uses a custom `shelve` backend. Note the `disk`
is an experimetal feature that may cause issues. Defaults to `dict`.

Input columns:
- text (`str`): the texts to be filtered.
Expand Down
Loading