-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow use of truncated Dictionary for coherence measures #1349
Conversation
…culation by avoiding lookup of tokens not in the topic token lists.
…vant words, and ensure each relevant word has a set in the `per_topic_postings` dict.
…s representation of corpus)
…e in strided_windows and iter_windows utiltity functions
… and add initial impl of accumulator that directly tracks term occurrence and co-occurrence counts
… package for all confirmation measures in the CoherenceModel pipeline
…ord are interpreted as the occurrence; update tests to cover this case; change the p_boolean_sliding_window to use the WordOccurrenceAccumulator; minor cleanup in test_coherencemodel
…oolean_sliding_window; add parameter for CoherenceModel to adjust number of processes used, with default equal to max(1, cpu_count - 1)
…ine_similarity function
Thanks! It is a useful feature. Could you please add a couple of paragraphs describing the patch - what approaches you tried and why. May I ask what is the advantage of having the
|
…f accumulator in CoherenceModel, and various test fixes.
@tmylk thank you for your comments on the PR. I'm glad it is viewed as a useful feature. The accumulator design is to support an abstraction on the underlying mechanism by which word occurrences and co-occurrences are retrieved. The existing inverted index based technique is not very memory friendly. I have not been able to use it on the Wikipedia corpus, which is a common reference corpus in the literature. I have pushed code to optimize this, including a new accumulator which uses a The accumulator design also leaves the door open for using lucene or some other on-disk inverted index as a backend for retrieving occurrence counts. I have opened another ticket for something like this (#1351). I've also updated the PR to include several optimizations for the probability estimation phase of coherence calculation. This includes a new strided approach for document windowing (see I also broke up the various stages of the pipeline into different methods, such that a user can get coherence measures for individual topics and view the results of topic segmentation. I realize these changes are somewhat beyond the scope of the original issue. It was motivated by your comment in that issue thread, "c_v is the best/recommended coherence so we should make it as fast as possible." I agree with this and was finding "c_v" unusably slow. I hope these contributions can make it more useful for more people. |
yield (tokens, (pageid, title)) | ||
else: | ||
yield tokens | ||
except KeyboardInterrupt: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this for? Masking user interrupts is an anti-pattern; deserves a detailed comment, at the very least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the code to handle the interrupt more appropriately. As for why I chose to handle the interrupt at all (at the risk of repeating the above comment): this pool may be active during many other phases of gensim execution if the underlying corpus/texts being iterated come from the wikicorpus
. It was confusing for me to see stdout activity when issuing an interrupt during execution of an entirely different code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not sure I understand.
When iterating over a WikiCorpus
, gensim uses a multiprocessing.Pool
(forks), yes.
Now when you do CTRL+C, what happened (old behaviour)? And what happens now (after your changes here)? Why is that preferable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the code I am running:
import os
import logging
import gensim
from gensim.models.coherencemodel import CoherenceModel
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
home = '/Users/user/workshop/nlp'
id2word = gensim.corpora.Dictionary.load_from_text(os.path.join(home, 'data', 'wikipedia-may-17_wordids.txt.bz2'))
texts = gensim.corpora.WikiCorpus(os.path.join(home, 'data', 'enwiki-latest-pages-articles.xml.bz2'), dictionary=id2word).get_texts()
lda = gensim.models.LdaModel.load(os.path.join(home, 'models', 'lda-k100_en-wiki.model'))
topics = [[t[0] for t in lda.get_topic_terms(i, 20)] for i in range(100)]
topic_words = [[id2word[token_id] for token_id in t] for t in topics]
cm = gensim.models.coherencemodel.CoherenceModel(topics=topic_words, texts=texts, dictionary=id2word, coherence='c_v', topn=20, window_size=110)
topic_coherences = cm.get_coherence_per_topic()
print(topic_coherences)
Before:
2017-05-30 07:34:29,027 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model
2017-05-30 07:34:29,028 : INFO : loading expElogbeta from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.expElogbeta.npy with mmap=None
2017-05-30 07:34:29,080 : INFO : setting ignored attribute id2word to None
2017-05-30 07:34:29,080 : INFO : setting ignored attribute state to None
2017-05-30 07:34:29,080 : INFO : setting ignored attribute dispatcher to None
2017-05-30 07:34:29,080 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model
2017-05-30 07:34:29,081 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state
2017-05-30 07:34:29,249 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state
2017-05-30 07:34:32,603 : INFO : using ParallelWordOccurrenceAccumulator(processes=7, batch_size=16) to estimate probabilities from sliding windows
2017-05-30 07:34:33,101 : INFO : submitted 0 batches to accumulate stats from 0 documents (90752 virtual)
2017-05-30 07:34:33,298 : INFO : submitted 1 batches to accumulate stats from 16 documents (154816 virtual)
2017-05-30 07:34:33,462 : INFO : submitted 2 batches to accumulate stats from 32 documents (223497 virtual)
2017-05-30 07:34:33,709 : INFO : submitted 3 batches to accumulate stats from 48 documents (285340 virtual)
2017-05-30 07:34:33,790 : INFO : submitted 4 batches to accumulate stats from 64 documents (342337 virtual)
2017-05-30 07:34:34,297 : INFO : submitted 5 batches to accumulate stats from 80 documents (415139 virtual)
2017-05-30 07:34:34,872 : INFO : submitted 6 batches to accumulate stats from 96 documents (484709 virtual)
2017-05-30 07:34:35,093 : INFO : submitted 7 batches to accumulate stats from 112 documents (542834 virtual)
2017-05-30 07:34:35,381 : INFO : submitted 8 batches to accumulate stats from 128 documents (628469 virtual)
2017-05-30 07:34:35,443 : INFO : submitted 9 batches to accumulate stats from 144 documents (691420 virtual)
2017-05-30 07:34:35,764 : INFO : submitted 10 batches to accumulate stats from 160 documents (741122 virtual)
2017-05-30 07:34:35,983 : INFO : submitted 11 batches to accumulate stats from 176 documents (774924 virtual)
2017-05-30 07:34:36,234 : INFO : submitted 12 batches to accumulate stats from 192 documents (829056 virtual)
2017-05-30 07:34:36,682 : INFO : submitted 13 batches to accumulate stats from 208 documents (887935 virtual)
^C2017-05-30 07:34:48,466 : WARNING : stats accumulation interrupted; <= 887935 documents processed
Process PoolWorker-12:
Process PoolWorker-10:
Process PoolWorker-11:
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Process PoolWorker-9:
Traceback (most recent call last):
Process PoolWorker-8:
Process PoolWorker-13:
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Process PoolWorker-14:
Traceback (most recent call last):
Traceback (most recent call last):
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
Traceback (most recent call last):
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
self.run()
self.run()
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self.run()
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
self._target(*self._args, **self._kwargs)
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
self.run()
self.run()
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
self.run()
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
self.run()
self._target(*self._args, **self._kwargs)
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/process.py", line 114, in run
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
self._target(*self._args, **self._kwargs)
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/pool.py", line 102, in worker
2017-05-30 07:34:48,467 : INFO : AccumulatingWorker interrupted after processing 8598 documents
2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 3630 documents
2017-05-30 07:34:48,467 : INFO : AccumulatingWorker interrupted after processing 7670 documents
2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 5859 documents
2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 9993 documents
2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 8987 documents
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,468 : INFO : AccumulatingWorker interrupted after processing 5062 documents
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
task = get()
task = get()
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
task = get()
task = get()
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
task = get()
task = get()
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 378, in get
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
task = get()
File "/Users/user/anaconda2/lib/python2.7/multiprocessing/queues.py", line 376, in get
racquire()
racquire()
racquire()
racquire()
racquire()
racquire()
return recv()
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
KeyboardInterrupt
2017-05-30 07:34:48,469 : INFO : serializing accumulator to return to master...
2017-05-30 07:34:48,471 : INFO : accumulator serialized
2017-05-30 07:34:48,473 : INFO : accumulator serialized
2017-05-30 07:34:48,471 : INFO : accumulator serialized
2017-05-30 07:34:48,473 : INFO : accumulator serialized
2017-05-30 07:34:48,471 : INFO : accumulator serialized
2017-05-30 07:34:48,474 : INFO : accumulator serialized
2017-05-30 07:34:48,474 : INFO : accumulator serialized
2017-05-30 07:34:50,727 : INFO : 7 accumulators retrieved from output queue
After:
2017-05-30 07:29:17,246 : WARNING : Slow version of gensim.models.doc2vec is being used
2017-05-30 07:29:17,253 : INFO : 'pattern' package not found; tag filters are not available for English
2017-05-30 07:29:18,188 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model
2017-05-30 07:29:18,198 : INFO : loading expElogbeta from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.expElogbeta.npy with mmap=None
2017-05-30 07:29:18,267 : INFO : setting ignored attribute id2word to None
2017-05-30 07:29:18,267 : INFO : setting ignored attribute state to None
2017-05-30 07:29:18,267 : INFO : setting ignored attribute dispatcher to None
2017-05-30 07:29:18,267 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model
2017-05-30 07:29:18,267 : INFO : loading LdaModel object from /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state
2017-05-30 07:29:18,422 : INFO : loaded /Users/user/workshop/nlp/models/lda-k100_en-wiki.model.state
2017-05-30 07:29:20,667 : INFO : using ParallelWordOccurrenceAccumulator(processes=7, batch_size=16) to estimate probabilities from sliding windows
2017-05-30 07:29:21,117 : INFO : submitted 0 batches to accumulate stats from 0 documents (90752 virtual)
2017-05-30 07:29:21,312 : INFO : submitted 1 batches to accumulate stats from 16 documents (154816 virtual)
2017-05-30 07:29:21,441 : INFO : submitted 2 batches to accumulate stats from 32 documents (223497 virtual)
2017-05-30 07:29:21,715 : INFO : submitted 3 batches to accumulate stats from 48 documents (285340 virtual)
2017-05-30 07:29:21,783 : INFO : submitted 4 batches to accumulate stats from 64 documents (342337 virtual)
2017-05-30 07:29:22,270 : INFO : submitted 5 batches to accumulate stats from 80 documents (415139 virtual)
2017-05-30 07:29:22,634 : INFO : submitted 6 batches to accumulate stats from 96 documents (484709 virtual)
2017-05-30 07:29:22,701 : INFO : submitted 7 batches to accumulate stats from 112 documents (542834 virtual)
2017-05-30 07:29:22,995 : INFO : submitted 8 batches to accumulate stats from 128 documents (628469 virtual)
2017-05-30 07:29:23,118 : INFO : submitted 9 batches to accumulate stats from 144 documents (691420 virtual)
2017-05-30 07:29:23,238 : INFO : submitted 10 batches to accumulate stats from 160 documents (741122 virtual)
2017-05-30 07:29:23,336 : INFO : submitted 11 batches to accumulate stats from 176 documents (774924 virtual)
2017-05-30 07:29:23,471 : INFO : submitted 12 batches to accumulate stats from 192 documents (829056 virtual)
2017-05-30 07:29:23,665 : INFO : submitted 13 batches to accumulate stats from 208 documents (887935 virtual)
^C2017-05-30 07:29:58,875 : WARNING : stats accumulation interrupted; <= 887935 documents processed
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 17775 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 27228 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 20910 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 26774 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 31392 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 32224 documents
2017-05-30 07:29:58,876 : INFO : AccumulatingWorker interrupted after processing 22497 documents
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,877 : INFO : serializing accumulator to return to master...
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:29:58,878 : INFO : accumulator serialized
2017-05-30 07:30:01,106 : INFO : 7 accumulators retrieved from output queue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the advantage is a cleaner log?
What are the disadvantages? (I am not familiar with this type of functionality, but I assume there must be some disadvantages, otherwise it would be the default behaviour).
gensim/corpora/wikicorpus.py
Outdated
@@ -300,22 +305,26 @@ def get_texts(self): | |||
articles, articles_all = 0, 0 | |||
positions, positions_all = 0, 0 | |||
texts = ((text, self.lemmatize, title, pageid) for title, text, pageid in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces)) | |||
pool = multiprocessing.Pool(self.processes) | |||
pool = multiprocessing.Pool(self.processes, init_worker) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change about? Needs a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@piskvorky I was running get_coherence
using coherence="c_v"
and trying to program it to gracefully handle KeyboardInterrupt
. You can see the mechanisms I put in place for this in the text_analysis
module in the PR. While doing this, I faced some confusion because some background process was still raising KeyboardInterrupt
. After some digging, I noticed that the wikicorpus
pool workers were the culprit.
This pool may be active during many other phases of gensim execution if the underlying corpus/texts being iterated come from the wikicorpus
. I think it makes things slightly cleaner in this case to handle the KeyboardInterrupt
in some manner that does not propagate. Perhaps some sort of logging would improve this?
gensim/models/coherencemodel.py
Outdated
return not current_set.issuperset(new_set) | ||
|
||
def _topics_differ(self, new_topics): | ||
return (new_topics is not None and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No vertical indent -- please use hanging indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
gensim/models/coherencemodel.py
Outdated
return (new_topics is not None and | ||
self._topics is not None and | ||
self._accumulator is not None and | ||
not np.equal(new_topics, self._topics).all()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what type the arguments are, but np.allclose
applicable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list
of np.ndarray
of int
s, so there should be no need for np.allclose
.
Wow, impressive refactor! Thanks a lot. |
…rpus.get_texts`; instead, log warning and do not set `length`.
…and non-empty blank lines in `text_analysis`.
…ctor for variable interpretability.
@piskvorky thank you for your review and consideration of this PR. I've responded to your comments and pushed changes. Please let me know if you would like me to make any additional changes and I will gladly do so. |
cc @dsquareindia , the original contributor of coherence to the package |
…ith repeated counting of tokens that occur more than once in a window.
… module; cleaned up spacing in coherencemodel.
…acking and avoid undue network traffic by moving relevancy filtering and token conversion to the master process.
…sing a `collections.Counter` instance for accumulation within a batch.
I've pushed a few further optimizations to the
|
except Exception as e: | ||
logger.error("worker encountered unexpected exception: %s" % e) | ||
logger.error(traceback.format_exc()) | ||
logger.error("worker encountered unexpected exception: %s\n%s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.exception
simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
q.put(batch, block=True) | ||
before = self._num_docs / self.log_every | ||
self._num_docs += sum(len(doc) - window_size + 1 for doc in batch) | ||
if before < (self._num_docs / self.log_every): | ||
logger.info("submitted %d batches to accumulate stats from %d documents (%d virtual)" % ( | ||
batch_num, (batch_num + 1) * self.batch_size, self._num_docs)) | ||
logger.info("%d batches submitted to accumulate stats from %d documents (%d " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent please (here and elsewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the desired format here. The hanging indent makes sense when using the %
operator to format a string. However, my desire here is to use the lazy formatting of the logging
module instead. In this case, the string and all following format arguments are equivalent arguments of the call, so pep8 dictates alignment with the opening delimiter.
… empty line at end of `util` module.
…he Dictionary mapping to different ids, so fixed the `probability_estimation` tests to be agnostic of this. Also fixed an issue with the interpretation of strings as iterables when getting occurrences of strings in the `text_analysis.BaseAnalyzer.__getitem__` method.
…rencemodel accumulator caching; made model a property with a setter that also sets the topics and uncaches the accumulator if the model's topics have ids not tracked by the accumulator.
@piskvorky I have replied to your requests for changes and made the PR compatible with python3. Thanks again for the feedback! Are there any additional changes you would like? |
gensim/test/test_text_analysis.py
Outdated
def setUp(self): | ||
self.dictionary = BaseTestCases.TextAnalyzerTestBase.dictionary | ||
self.top_ids = BaseTestCases.TextAnalyzerTestBase.top_ids | ||
self.corpus = [self.dictionary.doc2bow(doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent please (here and elsewhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this
self.corpus = \
[self.dictionary.doc2bow(doc) for doc in BaseTestCases.TextAnalyzerTestBase.texts]
or this?
self.corpus = [
self.dictionary.doc2bow(doc) for doc in BaseTestCases.TextAnalyzerTestBase.texts]
gensim/test/test_text_analysis.py
Outdated
|
||
def test_index_accumulation(self): | ||
accumulator = CorpusAccumulator(self.top_ids)\ | ||
.accumulate(self.corpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this line break? Better keep on a single line (more readable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was for consistency with the other places, where the call did not fit well on a single line (too long). I've changed to a single line.
gensim/corpora/wikicorpus.py
Outdated
" (total %i articles, %i positions before pruning articles shorter than %i words)", | ||
articles, positions, articles_all, positions_all, ARTICLE_MIN_WORDS) | ||
self.length = articles # cache corpus length | ||
texts = ((text, self.lemmatize, title, pageid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you ask for hanging indents for list comprehensions, can you give me an example of the style you prefer? Below are a few alternatives that come to mind. The first is the one I would assume you are asking for, but I've seen all of these before, so just want to clarify.
texts = (
(text, self.lemmatize, title, pageid)
for title, text, pageid
in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
or
texts = (
(text, self.lemmatize, title, pageid)
for title, text, pageid
in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces)
)
or
texts = \
((text, self.lemmatize, title, pageid)
for title, text, pageid
in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces))
gensim/corpora/wikicorpus.py
Outdated
texts = ((text, self.lemmatize, title, pageid) | ||
for title, text, pageid | ||
in extract_pages(bz2.BZ2File(self.fname), self.filter_namespaces)) | ||
pool = multiprocessing.Pool(self.processes, init_to_ignore_interrupt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the original interrupts be restored, once the pool is over (exception/terminated).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interrupts are only disabled in the children, which are terminated with the Pool. I'm not sure why, but there is no place to reply to your comment about the disadvantages. The disadvantage, to my understanding, is that termination of the children must now be performed by the parent, instead of occurring exactly when the KeyboardInterrupt
is sent to them. In this case, that is being done explicitly by calling pool.terminate()
. However, according to the docs, even if a second KeyboardInterrupt
is issued just after the first and the explicit call is not made, the GC will take care of calling it. The advantage here is a cleaner log, yes, but also not leaking the details of the wikicorpus iteration (multiprocessing or not), which can confuse developers and users when they interrupt some other part of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks.
yield (tokens, (pageid, title)) | ||
else: | ||
yield tokens | ||
except KeyboardInterrupt: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the advantage is a cleaner log?
What are the disadvantages? (I am not familiar with this type of functionality, but I assume there must be some disadvantages, otherwise it would be the default behaviour).
gensim/models/coherencemodel.py
Outdated
'c_npmi': make_pipeline(segmentation.s_one_one, | ||
COHERENCE_MEASURES = { | ||
'u_mass': _make_pipeline(segmentation.s_one_pre, | ||
probability_estimation.p_boolean_document, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent (here and other places below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
gensim/models/coherencemodel.py
Outdated
|
||
def _topics_differ(self, new_topics): | ||
return (new_topics is not None and | ||
self._topics is not None and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The indentation looks weird... a single line better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is, IMO, more confusing if limited to one line. I changed this to the standard hanging indent (8 spaces).
gensim/models/coherencemodel.py
Outdated
def _topics_differ(self, new_topics): | ||
return (new_topics is not None and | ||
self._topics is not None and | ||
not np.array_equal(new_topics, self._topics)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exact equality or "numerically close" (np.allclose
) desired?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exact, since these are integers
gensim/test/test_text_analysis.py
Outdated
top_ids = set(token2id.values()) | ||
|
||
texts2 = [['human', 'interface', 'computer'], | ||
['survey', 'user', 'computer', 'system', 'response', 'time'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
accumulator = WordOccurrenceAccumulator(top_ids, dictionary) | ||
else: | ||
accumulator = ParallelWordOccurrenceAccumulator(processes, top_ids, dictionary) | ||
logger.info("using %s to estimate probabilities from sliding windows" % accumulator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pass argument as a fnc argument, instead of formatting the string directly with %
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
… to return individual topic coherence values, then average those. Make the `ParallelWordOccurrenceAccumulator` return a `WordOccurrenceAccumulator` after accumulation, so it can be trained further afterwards if desired.
… individual topic coherence values, then average those.
@piskvorky I believe I've addressed all the change requests from your last review. Thank you for the review. Is there anything else you'd like changed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor code style comments.
gensim/test/test_coherencemodel.py
Outdated
vw_path = os.environ.get('VOWPAL_WABBIT_PATH', None) | ||
if not vw_path: | ||
msg = "Environment variable 'VOWPAL_WABBIT_PATH' not specified, skipping sanity checks for LDA Model" | ||
logging.info(msg) | ||
logging.info("Environment variable 'VOWPAL_WABBIT_PATH' not specified," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent please.
gensim/test/test_coherencemodel.py
Outdated
@@ -63,162 +55,201 @@ def setUp(self): | |||
['graph', 'minors', 'trees', 'eps']] | |||
self.topics2 = [['user', 'graph', 'minors', 'system'], | |||
['time', 'graph', 'survey', 'minors']] | |||
self.ldamodel = LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, passes=0, iterations=0) | |||
self.ldamodel = LdaModel(corpus=self.corpus, id2word=self.dictionary, num_topics=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent please.
gensim/test/test_coherencemodel.py
Outdated
|
||
class TestCoherenceModel(unittest.TestCase): | ||
|
||
# set up vars used in testing ("Deerwester" from the web tutorial) | ||
texts = [['human', 'interface', 'computer'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent please.
gensim/test/test_coherencemodel.py
Outdated
|
||
def testErrors(self): | ||
"""Test if errors are raised on bad input""" | ||
# not providing dictionary | ||
self.assertRaises(ValueError, CoherenceModel, topics=self.topics1, corpus=corpus, coherence='u_mass') | ||
self.assertRaises(ValueError, CoherenceModel, topics=self.topics1, corpus=self.corpus, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hanging indent (here and below).
for s_i in segmented_topics: | ||
for word_id in itertools.chain.from_iterable(s_i): | ||
if hasattr(word_id, '__iter__'): | ||
top_ids = top_ids.union(word_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update()
simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good call
…for unique ids from topic segments.
@piskvorky the most recent review has been addressed and all requested changes have been made. |
@@ -70,7 +70,7 @@ def unique_ids_from_segments(segmented_topics): | |||
for s_i in segmented_topics: | |||
for word_id in itertools.chain.from_iterable(s_i): | |||
if hasattr(word_id, '__iter__'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the ids guaranteed to be integers?
Because if they're strings, the __iter__
could get nasty (you can iterate over strings, yielding individual characters).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they will always be integers; I have updated the docstrings to better reflect this. The segmented topics will always come from the segmentation
module, and the setter in the CoherenceModel
ensures the segmentation module receives lists of integer arrays to be segmented.
@piskvorky @tmylk Anything else to change for this? Thanks! |
Hi @macks22! Thanks for this amazing overhaul! I guess it would be fun to see how much of a performance gain are we getting now compared to the previous code. It would also make for an interesting blog post maybe? |
+1 on @dsquareindia 's idea. How will the users find out about this new cool functionality? Some sort of Notebook tutorial (motivation, quick start) with a benchmark would be awesome. |
Sorry for the late response, your PR is great 👍 Thank you for the wonderful work @macks22 🥇 |
@dsquareindia @piskvorky @menshikh-iv @dsquareindia |
Hi @macks22,
Can you fix this? |
@menshikh-iv that's strange, I can't think of what might be causing that. I can try to replicate on a VM but likely won't be able to find time to get a windows dev environment set up until this weekend. |
Resolves #1342.