Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow use of truncated Dictionary for coherence measures #1342

Closed
macks22 opened this issue May 22, 2017 · 5 comments
Closed

Allow use of truncated Dictionary for coherence measures #1342

macks22 opened this issue May 22, 2017 · 5 comments

Comments

@macks22
Copy link
Contributor

macks22 commented May 22, 2017

Description

I used the make_wikicorpus.py script to parse the latest English Wikipedia. This script filters the extremes in the Dictionary using wiki.dictionary.filter_extremes(no_below=20, no_above=0.1, keep_n=100000). I then trained an LDA model on the corpus (see code below). Afterwards, I tried to compute the c_v coherence on this corpus using the Dictionary produced by the make_wikicorpus.py script. This failed because the Dictionary does not contain all terms in the Wikipedia corpus. It should not fail, because the terms it is missing are not terms in the top_ids set from the actual topics, and are therefore not relevant to the coherence computation.

Coherence calculation should be possible with the same dictionary used to train the model, even if it does not contain all tokens in the texts.

Steps/Code/Corpus to Reproduce

import gensim
id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt.bz2')
mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm')
lda = gensim.models.ldamulticore.LdaMulticore(corpus=mm, id2word=id2word, num_topics=100, chunksize=10000, passes=1)
topics = [[t[0] for t in lda.get_topic_terms(i, 20)] for i in range(100)]
texts = gensim.corpora.WikiCorpus('enwiki-latest-pages-articles.xml.bz2', dictionary=id2word).get_texts()
cm = gensim.models.coherencemodel.CoherenceModel(topics=topic_words, texts=texts, dictionary=id2word, coherence='c_v', topn=20, window_size=110)

Expected Results

Coherence measure computed from wikipedia corpus using the Dictionary built by the make_wikicorpus.py script.

Actual Results

cm.get_coherence()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-78-a170469ad280> in <module>()
----> 1 cm.get_coherence()

/Users/vru959/anaconda2/lib/python2.7/site-packages/gensim/models/coherencemodel.pyc in get_coherence(self)
    207                 self.window_size = sliding_windows_dict[self.coherence]
    208             per_topic_postings, num_windows = measure.prob(texts=self.texts, segmented_topics=segmented_topics,
--> 209                                                            dictionary=self.dictionary, window_size=self.window_size)
    210             if self.coherence == 'c_v':
    211                 confirmed_measures = measure.conf(self.topics, segmented_topics, per_topic_postings, 'nlr', 1, num_windows)

/Users/vru959/anaconda2/lib/python2.7/site-packages/gensim/topic_coherence/probability_estimation.pyc in p_boolean_sliding_window(texts, segmented_topics, dictionary, window_size)
     96         it = iter(document)
     97         window = tuple(islice(it, window_size))
---> 98         window_id, per_topic_postings = add_topic_posting(top_ids, window, per_topic_postings, window_id, token2id_dict)
     99         for elem in it:
    100             window = window[1:] + (elem,)

/Users/vru959/anaconda2/lib/python2.7/site-packages/gensim/topic_coherence/probability_estimation.pyc in add_topic_posting(top_ids, window, per_topic_postings, window_id, token2id_dict)
     84     def add_topic_posting(top_ids, window, per_topic_postings, window_id, token2id_dict):
     85         for word in window:
---> 86             word_id = token2id_dict[word]
     87             if word_id in top_ids:
     88                 if word_id in per_topic_postings:

KeyError: 'of'

Versions

Darwin-16.5.0-x86_64-i386-64bit
('Python', '2.7.13 |Anaconda custom (x86_64)| (default, Dec 20 2016, 23:05:08) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]')
('NumPy', '1.12.1')
('SciPy', '0.19.0')
('gensim', '2.1.0')
('FAST_VERSION', 1)
@menshikh-iv
Copy link
Contributor

Hello @macks22, thank you for your report. I need some time to reproduce the bug.

It seems to me that the problem in the texts variable (i think it contains tokens that not exists in the dictionary, because of dictionary parameter in gensim.corpora.WikiCorpus used only when you called method __iter__ (not get_texts()) for returning doc2bow representation)

Could you please filtering texts manually (remove tokens that dictionary does not contain) and try again?

@macks22
Copy link
Contributor Author

macks22 commented May 22, 2017

I'm fairly certain that is the issue; I've modified the topic_coherence.probability_estimation.p_boolean_sliding_window method to only try to convert words to their ids if they are actually in the set accumulated from the topics. So far, this appears to have fixed the issue, though I can't be sure until the get_coherence method finishes. It is slow on the full wikipedia corpus of texts.

@menshikh-iv
Copy link
Contributor

@macks22 coherence="c_v" so slow, you can try coherence="u_mass" (it works significantly faster).

macks22 pushed a commit to macks22/gensim that referenced this issue May 22, 2017
…culation by avoiding lookup of tokens not in the topic token lists.
macks22 pushed a commit to macks22/gensim that referenced this issue May 22, 2017
…vant words, and ensure each relevant word has a set in the `per_topic_postings` dict.
@macks22
Copy link
Contributor Author

macks22 commented May 22, 2017

@menshikh-iv thank you for the suggestion, but unfortunately the code path for coherence=u_mass does not use p_boolean_sliding_window, which I suspect is why it is much faster.

macks22 pushed a commit to macks22/gensim that referenced this issue May 22, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue May 22, 2017
@tmylk
Copy link
Contributor

tmylk commented May 22, 2017

c_v is the best/recommended coherence so we should make it as fast as possible

macks22 pushed a commit to macks22/gensim that referenced this issue May 30, 2017
…f accumulator in CoherenceModel, and various test fixes.
macks22 pushed a commit to macks22/gensim that referenced this issue May 30, 2017
…rpus.get_texts`; instead, log warning and do not set `length`.
macks22 pushed a commit to macks22/gensim that referenced this issue May 30, 2017
…and non-empty blank lines in `text_analysis`.
macks22 pushed a commit to macks22/gensim that referenced this issue May 30, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue May 31, 2017
…ith repeated counting of tokens that occur more than once in a window.
macks22 pushed a commit to macks22/gensim that referenced this issue May 31, 2017
… module; cleaned up spacing in coherencemodel.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017
…acking and avoid undue network traffic by moving relevancy filtering and token conversion to the master process.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017
…sing a `collections.Counter` instance for accumulation within a batch.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 1, 2017
…he Dictionary mapping to different ids, so fixed the `probability_estimation` tests to be agnostic of this. Also fixed an issue with the interpretation of strings as iterables when getting occurrences of strings in the `text_analysis.BaseAnalyzer.__getitem__` method.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 2, 2017
…rencemodel accumulator caching; made model a property with a setter that also sets the topics and uncaches the accumulator if the model's topics have ids not tracked by the accumulator.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 2, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 4, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 5, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 6, 2017
… to return individual topic coherence values, then average those. Make the `ParallelWordOccurrenceAccumulator` return a `WordOccurrenceAccumulator` after accumulation, so it can be trained further afterwards if desired.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 7, 2017
… individual topic coherence values, then average those.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 8, 2017
…for unique ids from topic segments.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants