-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new indirect confirmation measure based on word2vec similarity #1380
Comments
I'm working on an implementation of this, but it will build on #1349. |
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Jun 14, 2017
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Jun 14, 2017
… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Jun 14, 2017
…ures, and add test case to sanity check `word2vec_similarity`.
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Jun 14, 2017
…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Jun 14, 2017
…bility distributions for the probabilistic topic models.
Closed
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Jun 15, 2017
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Jun 16, 2017
… will uncache the accumulator and the topics will be shrunk/expanded accordingly.
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Jun 16, 2017
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Aug 13, 2017
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Aug 13, 2017
… to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means.
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Aug 13, 2017
…ures, and add test case to sanity check `word2vec_similarity`.
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Aug 13, 2017
…st coverage for this, and update the `CoherenceModel` to use this for getting topics from models.
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Aug 13, 2017
…bility distributions for the probabilistic topic models.
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Aug 13, 2017
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Aug 13, 2017
… will uncache the accumulator and the topics will be shrunk/expanded accordingly.
macks22
pushed a commit
to macks22/gensim
that referenced
this issue
Aug 13, 2017
Merged
menshikh-iv
pushed a commit
that referenced
this issue
Sep 18, 2017
* #1380: Initial implementation of coherence using word2vec similarity. * #1380: Add the `keyed_vectors` kwarg to the `CoherenceModel` to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means. * #1380: Add tests for `with_std` option for confirmation measures, and add test case to sanity check `word2vec_similarity`. * #1380: Add a `get_topics` method to all topic models, add test coverage for this, and update the `CoherenceModel` to use this for getting topics from models. * #1380: Require topics returned from `get_topics` to be probability distributions for the probabilistic topic models. * #1380: Clean up flake8 warnings. * #1380: Make `topn` a property so setting it to higher values will uncache the accumulator and the topics will be shrunk/expanded accordingly. * #1380: Pass through `with_std` argument for all coherence measures. * #1380: Initial implementation of coherence using word2vec similarity. * #1380: Add the `keyed_vectors` kwarg to the `CoherenceModel` to allow passing in pre-trained, pre-loaded word embeddings, and adjust the similarity measure to handle missing terms in the vocabulary. Add a `with_std` option to all confirmation measures that allows the caller to get the standard deviation between the topic segment sets as well as the means. * #1380: Add tests for `with_std` option for confirmation measures, and add test case to sanity check `word2vec_similarity`. * #1380: Add a `get_topics` method to all topic models, add test coverage for this, and update the `CoherenceModel` to use this for getting topics from models. * #1380: Require topics returned from `get_topics` to be probability distributions for the probabilistic topic models. * #1380: Clean up flake8 warnings. * #1380: Make `topn` a property so setting it to higher values will uncache the accumulator and the topics will be shrunk/expanded accordingly. * #1380: Pass through `with_std` argument for all coherence measures. * Update `test_coherencemodel` to skip Mallet and Vowpal Wabbit tests if the executables are not installed, instead of passing them inappropriately. * Fix trailing whitespace. * Add `get_topics` method to `BaseTopicModel` and update notebook for new Word2Vec-based coherence metric "c_w2v". * Add several helper methods to the `CoherenceModel` for comparing a set of models or top-N lists efficiently. Update the notebook to use the helper methods. Add `TextDirectoryCorpus` import in `corpora.__init__` so it can be imported from package level. Update notebook to use `corpora.TextDirectoryCorpus` instead of redefining it. * fix flake8 whitespace issues * fix order of imports in `corpora.__init__` * fix corpora.__init__ import order * push fix for setting `topn` in `CoherenceModel.for_topics` * Use `dict.pop` in place of checking and optionally getting and deleting topn in `CoherenceModel.for_topics`. * fix non-deterministic test failure in `test_coherencemodel` * Update coherence model selection notebook to use sklearn dataset loader to get 20 newsgroups corpus. Add `with_support` option to the confirmation measures to determine how many words were ignored during calculation. Add `flatten` function to `utils` that recursively flattens an iterable into a list. Improve the robustness of coherence model comparison by using nanmean and mean value imputation when looping over the grid of top-N values to compute coherence for a model. Fix too-long logging statement lines in `text_analysis`.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
As introduced in [1], topic coherence can be computed using word2vec similarity of terms. This is roughly compatible with the coherence evaluation framework posited by [2] and currently implemented in gensim. The probability estimation phase will involve training a word2vec model on the given corpus, and the confirmation measure will use the average cosine similarity between context vectors from the word2vec model. [1] did not analyze which segmentation strategy is optimal, but a reasonable guess is to use the same method as "c_v" coherence, since that method also uses context vectors in a semantic space.
The implementation should include the ability to pass in a pre-trained word2vec model or path to the same.
Steps/Code/Corpus to Reproduce
I'm envisioning an API something like the following:
Expected Results
Actual Results
Versions
This should be supported under all versions.
References
[1] D. O’Callaghan, D. Greene, J. Carthy, and P. Cunningham, “An analysis of the coherence of descriptors in topic modeling,” Expert Systems with Applications, vol. 42, no. 13, pp. 5645–5657, Aug. 2015.
[2] M. Röder, A. Both, and A. Hinneburg, “Exploring the Space of Topic Coherence Measures,” 2015, pp. 399–408.
The text was updated successfully, but these errors were encountered: