-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EnsembleLda #2282
EnsembleLda #2282
Conversation
CC @piskvorky, wdyt about this feature (not PR itself), is it a good idea to add it to gensim? |
ping @piskvorky |
Definitely! Getting more robust LDA topics sounds like an awesome feature (and the accompanying descriptions/motivations are exemplary – we can point to this in future PRs). I didn't have time to review the PR code, but yes, we prefer to avoid pandas (complex dependency, unclear memory semantics, encourages bad engineering patterns). Can you use normal list/dict comprehensions instead? Cheers. |
…into EnsembleLda
Minor fix: when running |
I'm working on it and will let you know when I'm done. |
…tion to simply pickling the whole thing.
…into EnsembleLda
EDIT: avoided the problem in the tests. Taking the distance matrix from the reference and continuing to work with that, then checking if the results are as expected. Checking the distance matrix only roughly to avoid failuring on (rounding?) problems with the cosine distance (maybe because it's due to the vectors being very long?). The pandas dependency is completely removed btw. The tests I have supplied work locally in the following environments: 4.19.32-1-MANJARO
Docker container with Ubuntu 18.04
Docker container with travisci/ci-garnet:packer-1515445631-7dfb2e1
|
… differences across architectures
@aloosley and me think this is ready to merge (if the tests pass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking very good.
Left you some comments. Please have a look and let me know when you're ready for another review.
In general:
- Tighter scope for context managers and try-catch
- PEP8
- gensim-specific formatting nitpicks (hanging indents, avoid .format, etc)
- Remove code comments that add little value.
See the comments for details.
I made it around halfway through ensemblelda.py before I ran out of time - I'll check the rest next time.
…singletons to list
…into EnsembleLda
…model. * Added more :meth: and `` `` styling for RST, fixed a few typos
Thanks for your efforts in reviewing our stuff! haven't checked if opinosis still works after the changes, don't merge yet. Gonna check the following weekend or something |
looks good. @mpenkov it's ready for another review |
@piskvorky @mpenkov any updates? |
Sorry, the beginning of the year was super busy. We have a bunch of PRs lined up, we'll get to this with @mpenkov shortly! Thanks. |
Ah I see, don't worry at all! |
I'm going to reopen this PR from my own fork, because I don't get access to the repo anymore, possibly due to organizational changes within that company. |
Tests
We have some tests that we can also commit when this PR is of interest, I will polish them and embed them into the unittest module in that case.They are commited.
Materials
As requested by @piskvorky, here are the materials that describe our module:
eLDA_algo_overview.pdf
eLDA_motivation.pdf
eLDA_when_to_use.pdf
Code ReviewThere are two places in "gensim/models/ensemblelda.py" for which I especially would like to have review:lines 611 and 784: The error handling of multiprocessingline 1124 and following: The way I provide functions crucial to gensims ldamodel api. Those functions just forward the calls to an internal object. So are they too redundant or do you think they make things easier? Is there a better way to do so?And another one for the "tox -e flake8" test, which does not pass here:would you prefer the implemention not to use pandas, since it has not been a dependency of gensim before?