Allow use of truncated Dictionary for coherence measures #1342

macks22 · 2017-05-22T13:35:03Z

Description

I used the make_wikicorpus.py script to parse the latest English Wikipedia. This script filters the extremes in the Dictionary using wiki.dictionary.filter_extremes(no_below=20, no_above=0.1, keep_n=100000). I then trained an LDA model on the corpus (see code below). Afterwards, I tried to compute the c_v coherence on this corpus using the Dictionary produced by the make_wikicorpus.py script. This failed because the Dictionary does not contain all terms in the Wikipedia corpus. It should not fail, because the terms it is missing are not terms in the top_ids set from the actual topics, and are therefore not relevant to the coherence computation.

Coherence calculation should be possible with the same dictionary used to train the model, even if it does not contain all tokens in the texts.

Steps/Code/Corpus to Reproduce

import gensim
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt.bz2')
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
lda = gensim.models.ldamulticore.LdaMulticore(corpus=mm, id2word=id2word, num_topics=100, chunksize=10000, passes=1)
topics = [[t[0] for t in lda.get_topic_terms(i, 20)] for i in range(100)]
texts = gensim.corpora.WikiCorpus('enwiki-latest-pages-articles.xml.bz2', dictionary=id2word).get_texts()
cm = gensim.models.coherencemodel.CoherenceModel(topics=topic_words, texts=texts, dictionary=id2word, coherence='c_v', topn=20, window_size=110)

Expected Results

Coherence measure computed from wikipedia corpus using the Dictionary built by the make_wikicorpus.py script.

Actual Results

cm.get_coherence()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-78-a170469ad280> in <module>()
----> 1 cm.get_coherence()

/Users/vru959/anaconda2/lib/python2.7/site-packages/gensim/models/coherencemodel.pyc in get_coherence(self)
    207                 self.window_size = sliding_windows_dict[self.coherence]
    208             per_topic_postings, num_windows = measure.prob(texts=self.texts, segmented_topics=segmented_topics,
--> 209                                                            dictionary=self.dictionary, window_size=self.window_size)
    210             if self.coherence == 'c_v':
    211                 confirmed_measures = measure.conf(self.topics, segmented_topics, per_topic_postings, 'nlr', 1, num_windows)

/Users/vru959/anaconda2/lib/python2.7/site-packages/gensim/topic_coherence/probability_estimation.pyc in p_boolean_sliding_window(texts, segmented_topics, dictionary, window_size)
     96         it = iter(document)
     97         window = tuple(islice(it, window_size))
---> 98         window_id, per_topic_postings = add_topic_posting(top_ids, window, per_topic_postings, window_id, token2id_dict)
     99         for elem in it:
    100             window = window[1:] + (elem,)

/Users/vru959/anaconda2/lib/python2.7/site-packages/gensim/topic_coherence/probability_estimation.pyc in add_topic_posting(top_ids, window, per_topic_postings, window_id, token2id_dict)
     84     def add_topic_posting(top_ids, window, per_topic_postings, window_id, token2id_dict):
     85         for word in window:
---> 86             word_id = token2id_dict[word]
     87             if word_id in top_ids:
     88                 if word_id in per_topic_postings:

KeyError: 'of'

Versions

Darwin-16.5.0-x86_64-i386-64bit
('Python', '2.7.13 |Anaconda custom (x86_64)| (default, Dec 20 2016, 23:05:08) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]')
('NumPy', '1.12.1')
('SciPy', '0.19.0')
('gensim', '2.1.0')
('FAST_VERSION', 1)

The text was updated successfully, but these errors were encountered:

menshikh-iv · 2017-05-22T14:34:50Z

Hello @macks22, thank you for your report. I need some time to reproduce the bug.

It seems to me that the problem in the texts variable (i think it contains tokens that not exists in the dictionary, because of dictionary parameter in gensim.corpora.WikiCorpus used only when you called method __iter__ (not get_texts()) for returning doc2bow representation)

Could you please filtering texts manually (remove tokens that dictionary does not contain) and try again?

macks22 · 2017-05-22T15:41:03Z

I'm fairly certain that is the issue; I've modified the topic_coherence.probability_estimation.p_boolean_sliding_window method to only try to convert words to their ids if they are actually in the set accumulated from the topics. So far, this appears to have fixed the issue, though I can't be sure until the get_coherence method finishes. It is slow on the full wikipedia corpus of texts.

menshikh-iv · 2017-05-22T16:06:33Z

@macks22 coherence="c_v" so slow, you can try coherence="u_mass" (it works significantly faster).

…culation by avoiding lookup of tokens not in the topic token lists.

…vant words, and ensure each relevant word has a set in the `per_topic_postings` dict.

macks22 · 2017-05-22T20:15:02Z

@menshikh-iv thank you for the suggestion, but unfortunately the code path for coherence=u_mass does not use p_boolean_sliding_window, which I suspect is why it is much faster.

…el` module.

tmylk · 2017-05-22T21:39:46Z

c_v is the best/recommended coherence so we should make it as fast as possible

…f accumulator in CoherenceModel, and various test fixes.

…rpus.get_texts`; instead, log warning and do not set `length`.

…and non-empty blank lines in `text_analysis`.

…ctor for variable interpretability.

…ith repeated counting of tokens that occur more than once in a window.

… module; cleaned up spacing in coherencemodel.

…acking and avoid undue network traffic by moving relevancy filtering and token conversion to the master process.

…sing a `collections.Counter` instance for accumulation within a batch.

… empty line at end of `util` module.

…lity?

…he Dictionary mapping to different ids, so fixed the `probability_estimation` tests to be agnostic of this. Also fixed an issue with the interpretation of strings as iterables when getting occurrences of strings in the `text_analysis.BaseAnalyzer.__getitem__` method.

…rencemodel accumulator caching; made model a property with a setter that also sets the topics and uncaches the accumulator if the model's topics have ids not tracked by the accumulator.

…lines.

… to return individual topic coherence values, then average those. Make the `ParallelWordOccurrenceAccumulator` return a `WordOccurrenceAccumulator` after accumulation, so it can be trained further afterwards if desired.

… individual topic coherence values, then average those.

…for unique ids from topic segments.

… module.

macks22 pushed a commit to macks22/gensim that referenced this issue May 22, 2017

piskvorky#1342: Allow use of truncated Dictionary for coherence cal…

9d06a1f

…culation by avoiding lookup of tokens not in the topic token lists.

macks22 pushed a commit to macks22/gensim that referenced this issue May 22, 2017

piskvorky#1342: Do not produce sliding windows for texts with no rele…

f69a2ff

…vant words, and ensure each relevant word has a set in the `per_topic_postings` dict.

macks22 pushed a commit to macks22/gensim that referenced this issue May 22, 2017

piskvorky#1342: Remove unused multiprocessing import in `coherencemod…

578fe06

…el` module.

macks22 pushed a commit to macks22/gensim that referenced this issue May 22, 2017

piskvorky#1342: Remove unused multiprocessing import in `coherencemod…

26de547

…el` module.

macks22 mentioned this issue May 22, 2017

Allow use of truncated Dictionary for coherence measures #1349

Merged

macks22 pushed a commit to macks22/gensim that referenced this issue May 30, 2017

piskvorky#1342: Cleanup, documentation improvements, proper caching o…

327b739

…f accumulator in CoherenceModel, and various test fixes.

macks22 pushed a commit to macks22/gensim that referenced this issue May 30, 2017

piskvorky#1342: Do not swallow KeyboardInterrupt naively in `WikiCo…

e06c7c3

…rpus.get_texts`; instead, log warning and do not set `length`.

macks22 pushed a commit to macks22/gensim that referenced this issue May 30, 2017

piskvorky#1342: Formatting fixes (hanging indent in coherencemodel …

2ca43f7

…and non-empty blank lines in `text_analysis`.

macks22 pushed a commit to macks22/gensim that referenced this issue May 30, 2017

piskvorky#1342: Improve CoherenceModel documentation and minor refa…

825b0e9

…ctor for variable interpretability.

macks22 pushed a commit to macks22/gensim that referenced this issue May 31, 2017

piskvorky#1342: Optimize word occurrence accumulation and fix a bug w…

314a400

…ith repeated counting of tokens that occur more than once in a window.

macks22 pushed a commit to macks22/gensim that referenced this issue May 31, 2017

piskvorky#1342: Minor bug fixes and improved logging in text_analysis…

e785773

… module; cleaned up spacing in coherencemodel.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017

piskvorky#1342: Optimize data structures being used for window set tr…

5f78cdb

…acking and avoid undue network traffic by moving relevancy filtering and token conversion to the master process.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017

piskvorky#1342: Fix accidental typo.

bbd2748

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017

piskvorky#1342: Further optimize word co-occurrence accumulation by u…

5fb0b95

…sing a `collections.Counter` instance for accumulation within a batch.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017

piskvorky#1342: Clean up logging in text_analysis module and remove…

880b8d0

… empty line at end of `util` module.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017

piskvorky#1342: Remove unused traceback module.

1d32b8e

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017

piskvorky#1342: Fixes for python3 compatibility.

8e04b41

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017

piskvorky#1342: Hopefully six.viewitems works for python3 compatibi…

e3ce402

…lity?

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 2, 2017

piskvorky#1342: Further tests for persistence of accumulator.

1ce8a72

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 4, 2017

piskvorky#1342: Add test case for CorpusAccumulator.

96fd343

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 5, 2017

piskvorky#1342: Formatting fixes for hanging indents and overly long …

a631ab6

…lines.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 7, 2017

piskvorky#1342: Fix direct_confirmation_measure functions to return…

b941f3c

… individual topic coherence values, then average those.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 8, 2017

piskvorky#1342: Hanging indents and switch out union with update …

75fcac8

…for unique ids from topic segments.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 9, 2017

piskvorky#1342: Clarify documentation in the probability_estimation…

96d1349

… module.

menshikh-iv closed this as completed in #1349 Jun 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow use of truncated Dictionary for coherence measures #1342

Allow use of truncated Dictionary for coherence measures #1342

macks22 commented May 22, 2017

menshikh-iv commented May 22, 2017

macks22 commented May 22, 2017

menshikh-iv commented May 22, 2017

macks22 commented May 22, 2017

tmylk commented May 22, 2017

Allow use of truncated Dictionary for coherence measures #1342

Allow use of truncated Dictionary for coherence measures #1342

Comments

macks22 commented May 22, 2017

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

menshikh-iv commented May 22, 2017

macks22 commented May 22, 2017

menshikh-iv commented May 22, 2017

macks22 commented May 22, 2017

tmylk commented May 22, 2017