diff --git a/README.md b/README.md index 05f8fa4..67ee0ec 100644 --- a/README.md +++ b/README.md @@ -262,12 +262,43 @@ deduplicated_texts = semhash.self_deduplicate() +
+ Using Pandas DataFrames +
+ +You can easily use Pandas DataFrames with SemHash. The following code snippet shows how to deduplicate a Pandas DataFrame: + +```python +import pandas as pd +from datasets import load_dataset +from semhash import SemHash + +# Load a dataset as a pandas dataframe +dataframe = load_dataset("ag_news", split="train").to_pandas() + +# Convert the dataframe to a list of dictionaries +dataframe = dataframe.to_dict(orient="records") + +# Initialize a SemHash instance with the columns to deduplicate +semhash = SemHash.from_records(records=dataframe, columns=["text"]) + +# Deduplicate the texts +deduplicated_records = semhash.self_deduplicate().deduplicated + +# Convert the deduplicated records back to a pandas dataframe +deduplicated_dataframe = pd.DataFrame(deduplicated_records) +``` + +
+ NOTE: By default, we use the ANN (approximate-nearest neighbors) backend for deduplication. We recommend keeping this since the recall for smaller datasets is ~100%, and it's needed for larger datasets (>1M samples) since these will take too long to deduplicate without ANN. If you want to use the flat/exact-matching backend, you can set `use_ann=False` in the SemHash constructor: ```python semhash = SemHash.from_records(records=texts, use_ann=False) ``` + + ## Benchmarks We've benchmarked SemHash on a variety of datasets to measure the deduplication performance and speed. The benchmarks were run with the following setup: