Add a new indirect confirmation measure based on word2vec similarity #1380

macks22 · 2017-06-01T16:02:49Z

Description

As introduced in [1], topic coherence can be computed using word2vec similarity of terms. This is roughly compatible with the coherence evaluation framework posited by [2] and currently implemented in gensim. The probability estimation phase will involve training a word2vec model on the given corpus, and the confirmation measure will use the average cosine similarity between context vectors from the word2vec model. [1] did not analyze which segmentation strategy is optimal, but a reasonable guess is to use the same method as "c_v" coherence, since that method also uses context vectors in a semantic space.

The implementation should include the ability to pass in a pre-trained word2vec model or path to the same.

Steps/Code/Corpus to Reproduce

I'm envisioning an API something like the following:

cm = CoherenceModel(corpus=corpus, coherence="c_w2v", embeddings="./path-to-model.bin.gz", embeddings_kwargs=dict(binary=True))
# OR
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin.gz', binary=True)
cm = CoherenceModel(corpus=corpus, coherence="c_w2v", embeddings=model)
cm.get_coherence()

Expected Results

Actual Results

Versions

This should be supported under all versions.

References

[1] D. O’Callaghan, D. Greene, J. Carthy, and P. Cunningham, “An analysis of the coherence of descriptors in topic modeling,” Expert Systems with Applications, vol. 42, no. 13, pp. 5645–5657, Aug. 2015.
[2] M. Röder, A. Both, and A. Hinneburg, “Exploring the Space of Topic Coherence Measures,” 2015, pp. 399–408.

The text was updated successfully, but these errors were encountered:

macks22 · 2017-06-06T18:20:35Z

I'm working on an implementation of this, but it will build on #1349.

…milarity.

… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.

…ures, and add test case to sanity check `word2vec_similarity`.

…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.

…bility distributions for the probabilistic topic models.

… will uncache the accumulator and the topics will be shrunk/expanded accordingly.

…asures.

…milarity.

… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.

…ures, and add test case to sanity check `word2vec_similarity`.

…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.

…bility distributions for the probabilistic topic models.

… will uncache the accumulator and the topics will be shrunk/expanded accordingly.

…asures.

* #1380: Initial implementation of coherence using word2vec similarity. * #1380: Add the `keyed_vectors` kwarg to the `CoherenceModel` to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means. * #1380: Add tests for `with_std` option for confirmation measures, and add test case to sanity check `word2vec_similarity`. * #1380: Add a `get_topics` method to all topic models, add test coverage for this, and update the `CoherenceModel` to use this for getting topics from models. * #1380: Require topics returned from `get_topics` to be probability distributions for the probabilistic topic models. * #1380: Clean up flake8 warnings. * #1380: Make `topn` a property so setting it to higher values will uncache the accumulator and the topics will be shrunk/expanded accordingly. * #1380: Pass through `with_std` argument for all coherence measures. * #1380: Initial implementation of coherence using word2vec similarity. * #1380: Add the `keyed_vectors` kwarg to the `CoherenceModel` to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means. * #1380: Add tests for `with_std` option for confirmation measures, and add test case to sanity check `word2vec_similarity`. * #1380: Add a `get_topics` method to all topic models, add test coverage for this, and update the `CoherenceModel` to use this for getting topics from models. * #1380: Require topics returned from `get_topics` to be probability distributions for the probabilistic topic models. * #1380: Clean up flake8 warnings. * #1380: Make `topn` a property so setting it to higher values will uncache the accumulator and the topics will be shrunk/expanded accordingly. * #1380: Pass through `with_std` argument for all coherence measures. * Update `test_coherencemodel` to skip Mallet and Vowpal Wabbit tests if the executables are not installed, instead of passing them inappropriately. * Fix trailing whitespace. * Add `get_topics` method to `BaseTopicModel` and update notebook for new Word2Vec-based coherence metric "c_w2v". * Add several helper methods to the `CoherenceModel` for comparing a set of models or top-N lists efficiently. Update the notebook to use the helper methods. Add `TextDirectoryCorpus` import in `corpora.__init__` so it can be imported from package level. Update notebook to use `corpora.TextDirectoryCorpus` instead of redefining it. * fix flake8 whitespace issues * fix order of imports in `corpora.__init__` * fix corpora.__init__ import order * push fix for setting `topn` in `CoherenceModel.for_topics` * Use `dict.pop` in place of checking and optionally getting and deleting topn in `CoherenceModel.for_topics`. * fix non-deterministic test failure in `test_coherencemodel` * Update coherence model selection notebook to use sklearn dataset loader to get 20 newsgroups corpus. Add `with_support` option to the confirmation measures to determine how many words were ignored during calculation. Add `flatten` function to `utils` that recursively flattens an iterable into a list. Improve the robustness of coherence model comparison by using nanmean and mean value imputation when looping over the grid of top-N values to compute coherence for a model. Fix too-long logging statement lines in `text_analysis`.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 14, 2017

piskvorky#1380: Initial implementation of coherence using word2vec si…

b1aa1d9

…milarity.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 14, 2017

piskvorky#1380: Add tests for with_std option for confirmation meas…

60096e1

…ures, and add test case to sanity check `word2vec_similarity`.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 14, 2017

piskvorky#1380: Add a get_topics method to all topic models, add te…

042ac8b

…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 14, 2017

piskvorky#1380: Require topics returned from get_topics to be proba…

98e74b1

…bility distributions for the probabilistic topic models.

macks22 mentioned this issue Jun 14, 2017

Word2vec coherence #1416

Closed

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 15, 2017

piskvorky#1380: Clean up flake8 warnings.

aebf987

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 16, 2017

piskvorky#1380: Make topn a property so setting it to higher values…

a13ba74

… will uncache the accumulator and the topics will be shrunk/expanded accordingly.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 16, 2017

piskvorky#1380: Pass through with_std argument for all coherence me…

8690dc3

…asures.

macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017

piskvorky#1380: Initial implementation of coherence using word2vec si…

a1f9127

…milarity.

macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017

piskvorky#1380: Add tests for with_std option for confirmation meas…

94fe67b

…ures, and add test case to sanity check `word2vec_similarity`.

macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017

piskvorky#1380: Add a get_topics method to all topic models, add te…

24686ce

…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.

macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017

piskvorky#1380: Require topics returned from get_topics to be proba…

0b0b7ec

…bility distributions for the probabilistic topic models.

macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017

piskvorky#1380: Clean up flake8 warnings.

92e5455

macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017

piskvorky#1380: Make topn a property so setting it to higher values…

f8ecab7

… will uncache the accumulator and the topics will be shrunk/expanded accordingly.

macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017

piskvorky#1380: Pass through with_std argument for all coherence me…

59f9fb7

…asures.

macks22 mentioned this issue Aug 13, 2017

Word2vec coherence #1530

Merged

menshikh-iv closed this as completed in #1530 Sep 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new indirect confirmation measure based on word2vec similarity #1380

Add a new indirect confirmation measure based on word2vec similarity #1380

macks22 commented Jun 1, 2017

macks22 commented Jun 6, 2017

Add a new indirect confirmation measure based on word2vec similarity #1380

Add a new indirect confirmation measure based on word2vec similarity #1380

Comments

macks22 commented Jun 1, 2017

Description

Steps/Code/Corpus to Reproduce

Expected Results

Actual Results

Versions

References

macks22 commented Jun 6, 2017