Skip to content

Commit

Permalink
add semhash
Browse files Browse the repository at this point in the history
  • Loading branch information
baniasbaabe committed Jan 23, 2025
1 parent 7f2c5d7 commit 7166a9b
Showing 1 changed file with 53 additions and 0 deletions.
53 changes: 53 additions & 0 deletions book/cooltools/Chapter.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2502,6 +2502,59 @@
"REDIS='{\"host\": \"localhost\", \"port\": 6379}'\n",
"'''"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Deduplicate Huge Datasets with `semhash`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Deduplicate your data at lightning speed in Python! 🔥\n",
"\n",
"Having duplicates in your dataset is annoying and needs to be removed as they do not contribute positively to model training.\n",
"\n",
"But they can be difficult to detect, especially semantic duplicates.\n",
"\n",
"Fortunately, 𝐬𝐞𝐦𝐡𝐚𝐬𝐡 has you covered!\n",
"\n",
"𝐬𝐞𝐦𝐡𝐚𝐬𝐡 deduplicates your dataset at lightning speed.\n",
"\n",
"It uses fast embedding generation with Model2Vec and optional ANN-based similarity search with Vicinity.\n",
"\n",
"For a dataset of 1.8M rows, 𝐬𝐞𝐦𝐡𝐚𝐬𝐡 takes 83 seconds to deduplicate. 🔥\n",
"\n",
"You can, of course, use any model supported by sentence-transformers, or bring your own model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!pip install semhash"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from datasets import load_dataset\n",
"from semhash import SemHash\n",
"\n",
"texts = load_dataset(\"ag_news\", split=\"train\")[\"text\"]\n",
"\n",
"semhash = SemHash.from_records(records=texts)\n",
"\n",
"deduplicated_texts = semhash.self_deduplicate().deduplicated"
]
}
],
"metadata": {
Expand Down

0 comments on commit 7166a9b

Please sign in to comment.