add semhash

baniasbaabe · Jan 23, 2025 · 7166a9b · 7166a9b
1 parent 7f2c5d7
commit 7166a9b
Showing 1 changed file with 53 additions and 0 deletions.
diff --git a/book/cooltools/Chapter.ipynb b/book/cooltools/Chapter.ipynb
@@ -2502,6 +2502,59 @@
     "REDIS='{\"host\": \"localhost\", \"port\": 6379}'\n",
     "'''"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Deduplicate Huge Datasets with `semhash`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Deduplicate your data at lightning speed in Python! 🔥\n",
+    "\n",
+    "Having duplicates in your dataset is annoying and needs to be removed as they do not contribute positively to model training.\n",
+    "\n",
+    "But they can be difficult to detect, especially semantic duplicates.\n",
+    "\n",
+    "Fortunately, 𝐬𝐞𝐦𝐡𝐚𝐬𝐡 has you covered!\n",
+    "\n",
+    "𝐬𝐞𝐦𝐡𝐚𝐬𝐡 deduplicates your dataset at lightning speed.\n",
+    "\n",
+    "It uses fast embedding generation with Model2Vec and optional ANN-based similarity search with Vicinity.\n",
+    "\n",
+    "For a dataset of 1.8M rows, 𝐬𝐞𝐦𝐡𝐚𝐬𝐡 takes 83 seconds to deduplicate. 🔥\n",
+    "\n",
+    "You can, of course, use any model supported by sentence-transformers, or bring your own model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install semhash"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "from semhash import SemHash\n",
+    "\n",
+    "texts = load_dataset(\"ag_news\", split=\"train\")[\"text\"]\n",
+    "\n",
+    "semhash = SemHash.from_records(records=texts)\n",
+    "\n",
+    "deduplicated_texts = semhash.self_deduplicate().deduplicated"
+   ]
   }
  ],
  "metadata": {