-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow use of truncated Dictionary for coherence measures #1342
Comments
Hello @macks22, thank you for your report. I need some time to reproduce the bug. It seems to me that the problem in the Could you please filtering |
I'm fairly certain that is the issue; I've modified the |
@macks22 |
…culation by avoiding lookup of tokens not in the topic token lists.
…vant words, and ensure each relevant word has a set in the `per_topic_postings` dict.
@menshikh-iv thank you for the suggestion, but unfortunately the code path for |
c_v is the best/recommended coherence so we should make it as fast as possible |
…f accumulator in CoherenceModel, and various test fixes.
…rpus.get_texts`; instead, log warning and do not set `length`.
…and non-empty blank lines in `text_analysis`.
…ctor for variable interpretability.
…ith repeated counting of tokens that occur more than once in a window.
… module; cleaned up spacing in coherencemodel.
…acking and avoid undue network traffic by moving relevancy filtering and token conversion to the master process.
…sing a `collections.Counter` instance for accumulation within a batch.
… empty line at end of `util` module.
…he Dictionary mapping to different ids, so fixed the `probability_estimation` tests to be agnostic of this. Also fixed an issue with the interpretation of strings as iterables when getting occurrences of strings in the `text_analysis.BaseAnalyzer.__getitem__` method.
…rencemodel accumulator caching; made model a property with a setter that also sets the topics and uncaches the accumulator if the model's topics have ids not tracked by the accumulator.
… to return individual topic coherence values, then average those. Make the `ParallelWordOccurrenceAccumulator` return a `WordOccurrenceAccumulator` after accumulation, so it can be trained further afterwards if desired.
… individual topic coherence values, then average those.
…for unique ids from topic segments.
Description
I used the
make_wikicorpus.py
script to parse the latest English Wikipedia. This script filters the extremes in theDictionary
usingwiki.dictionary.filter_extremes(no_below=20, no_above=0.1, keep_n=100000)
. I then trained an LDA model on the corpus (see code below). Afterwards, I tried to compute thec_v
coherence on this corpus using theDictionary
produced by themake_wikicorpus.py
script. This failed because theDictionary
does not contain all terms in the Wikipedia corpus. It should not fail, because the terms it is missing are not terms in thetop_ids
set from the actual topics, and are therefore not relevant to the coherence computation.Coherence calculation should be possible with the same dictionary used to train the model, even if it does not contain all tokens in the texts.
Steps/Code/Corpus to Reproduce
Expected Results
Coherence measure computed from wikipedia corpus using the
Dictionary
built by themake_wikicorpus.py
script.Actual Results
Versions
The text was updated successfully, but these errors were encountered: