From 2d9f7777f81d5d3b312e17891fd7ce189744f449 Mon Sep 17 00:00:00 2001 From: Thomas McMurphy Date: Mon, 22 May 2017 18:54:29 +0000 Subject: [PATCH] Add paragraph describing dictionary.dfs and dictionary.compactify() In code snippet 13 there are two new concepts introduced that have not been explained yet. In addition the workflow to create the dictionary here is completely different from the workflow described in code snippets 4 and 5. I've added a paragraph that tries to explain the new workflow and concepts. --- docs/notebooks/Corpora_and_Vector_Spaces.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/notebooks/Corpora_and_Vector_Spaces.ipynb b/docs/notebooks/Corpora_and_Vector_Spaces.ipynb index c6f7b6b189..2258f46fda 100644 --- a/docs/notebooks/Corpora_and_Vector_Spaces.ipynb +++ b/docs/notebooks/Corpora_and_Vector_Spaces.ipynb @@ -340,7 +340,7 @@ "source": [ "Although the output is the same as for the plain Python list, the corpus is now much more memory friendly, because at most one vector resides in RAM at a time. Your corpus can now be as large as you want.\n", "\n", - "Similarly, to construct the dictionary without loading all texts into memory:" + "We are going to create the dictionary from the mycorpus.txt file without loading the entire file into memory. Then, we will generate the list of token ids to remove from this dictionary by querying the dictionary for the token ids of the stop words, and by querying the document frequencies dictionary (dictionary.dfs) for token ids that only appear once. Finally, we will filter these token ids out of our dictionary and call dictionary.compactify() to remove the gaps in the token id series." ] }, {