Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new indirect confirmation measure based on word2vec similarity #1380

Closed
macks22 opened this issue Jun 1, 2017 · 1 comment · Fixed by #1530
Closed

Add a new indirect confirmation measure based on word2vec similarity #1380

macks22 opened this issue Jun 1, 2017 · 1 comment · Fixed by #1530

Comments

@macks22
Copy link
Contributor

macks22 commented Jun 1, 2017

Description

As introduced in [1], topic coherence can be computed using word2vec similarity of terms. This is roughly compatible with the coherence evaluation framework posited by [2] and currently implemented in gensim. The probability estimation phase will involve training a word2vec model on the given corpus, and the confirmation measure will use the average cosine similarity between context vectors from the word2vec model. [1] did not analyze which segmentation strategy is optimal, but a reasonable guess is to use the same method as "c_v" coherence, since that method also uses context vectors in a semantic space.

The implementation should include the ability to pass in a pre-trained word2vec model or path to the same.

Steps/Code/Corpus to Reproduce

I'm envisioning an API something like the following:

cm = CoherenceModel(corpus=corpus, coherence="c_w2v", embeddings="./path-to-model.bin.gz", embeddings_kwargs=dict(binary=True))
# OR
model = gensim.models.KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin.gz', binary=True)
cm = CoherenceModel(corpus=corpus, coherence="c_w2v", embeddings=model)
cm.get_coherence()

Expected Results

Actual Results

Versions

This should be supported under all versions.

References

[1] D. O’Callaghan, D. Greene, J. Carthy, and P. Cunningham, “An analysis of the coherence of descriptors in topic modeling,” Expert Systems with Applications, vol. 42, no. 13, pp. 5645–5657, Aug. 2015.
[2] M. Röder, A. Both, and A. Hinneburg, “Exploring the Space of Topic Coherence Measures,” 2015, pp. 399–408.

@macks22
Copy link
Contributor Author

macks22 commented Jun 6, 2017

I'm working on an implementation of this, but it will build on #1349.

macks22 pushed a commit to macks22/gensim that referenced this issue Jun 14, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 14, 2017
… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 14, 2017
…ures, and add test case to sanity check `word2vec_similarity`.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 14, 2017
…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 14, 2017
…bility distributions for the probabilistic topic models.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 15, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 16, 2017
… will uncache the accumulator and the topics will be shrunk/expanded accordingly.
macks22 pushed a commit to macks22/gensim that referenced this issue Jun 16, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017
… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.
macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017
…ures, and add test case to sanity check `word2vec_similarity`.
macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017
…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.
macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017
…bility distributions for the probabilistic topic models.
macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017
macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017
… will uncache the accumulator and the topics will be shrunk/expanded accordingly.
macks22 pushed a commit to macks22/gensim that referenced this issue Aug 13, 2017
menshikh-iv pushed a commit that referenced this issue Sep 18, 2017
* #1380: Initial implementation of coherence using word2vec similarity.

* #1380: Add the `keyed_vectors` kwarg to the `CoherenceModel` to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.

* #1380: Add tests for `with_std` option for confirmation measures, and add test case to sanity check `word2vec_similarity`.

* #1380: Add a `get_topics` method to all topic models, add test coverage for this, and update the `CoherenceModel` to use this for getting topics from models.

* #1380: Require topics returned from `get_topics` to be probability distributions for the probabilistic topic models.

* #1380: Clean up flake8 warnings.

* #1380: Make `topn` a property so setting it to higher values will uncache the accumulator and the topics will be shrunk/expanded accordingly.

* #1380: Pass through `with_std` argument for all coherence measures.

* #1380: Initial implementation of coherence using word2vec similarity.

* #1380: Add the `keyed_vectors` kwarg to the `CoherenceModel` to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.

* #1380: Add tests for `with_std` option for confirmation measures, and add test case to sanity check `word2vec_similarity`.

* #1380: Add a `get_topics` method to all topic models, add test coverage for this, and update the `CoherenceModel` to use this for getting topics from models.

* #1380: Require topics returned from `get_topics` to be probability distributions for the probabilistic topic models.

* #1380: Clean up flake8 warnings.

* #1380: Make `topn` a property so setting it to higher values will uncache the accumulator and the topics will be shrunk/expanded accordingly.

* #1380: Pass through `with_std` argument for all coherence measures.

* Update `test_coherencemodel` to skip Mallet and Vowpal Wabbit tests if the executables are not installed, instead of passing them inappropriately.

* Fix trailing whitespace.

* Add `get_topics` method to `BaseTopicModel` and update notebook for new Word2Vec-based coherence metric "c_w2v".

* Add several helper methods to the `CoherenceModel` for comparing a set of models or top-N lists efficiently. Update the notebook to use the helper methods. Add `TextDirectoryCorpus` import in `corpora.__init__` so it can be imported from package level. Update notebook to use `corpora.TextDirectoryCorpus` instead of redefining it.

* fix flake8 whitespace issues

* fix order of imports in `corpora.__init__`

* fix corpora.__init__ import order

* push fix for setting `topn` in `CoherenceModel.for_topics`

* Use `dict.pop` in place of checking and optionally getting and deleting topn in `CoherenceModel.for_topics`.

* fix non-deterministic test failure in `test_coherencemodel`

* Update coherence model selection notebook to use sklearn dataset loader to get 20 newsgroups corpus. Add `with_support` option to the confirmation measures to determine how many words were ignored during calculation. Add `flatten` function to `utils` that recursively flattens an iterable into a list. Improve the robustness of coherence model comparison by using nanmean and mean value imputation when looping over the grid of top-N values to compute coherence for a model. Fix too-long logging statement lines in `text_analysis`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant