Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EnsembleLda #2282

Closed
wants to merge 133 commits into from
Closed
Show file tree
Hide file tree
Changes from 129 commits
Commits
Show all changes
133 commits
Select commit Hold shift + click to select a range
7b73db9
added EnsembleLda
Dec 1, 2018
241c17e
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Dec 1, 2018
51945e4
Merge branch 'master' of https://github.com/rare-technologies/gensim …
Mar 12, 2019
a67d5db
improvements to add_model, various small changes to comments and code
Apr 5, 2019
e27be0a
pandas -> numpy: group by label and mean
Apr 5, 2019
83de2dd
pandas -> numpy: generate_stable_topics
Apr 6, 2019
2af1658
pandas -> numpy: distance matrix creation
Apr 7, 2019
100bbf0
pandas -> numpy: CBDBSCAN
Apr 7, 2019
aff3287
fixes for automated checks
Apr 7, 2019
a545ddf
improvements on logs, comments and variable naming. Changed save func…
Apr 8, 2019
d1a6854
minor fix in log message format
Apr 8, 2019
3650895
added tests
Apr 9, 2019
00a06e9
fixed test
Apr 9, 2019
f5f1c9c
removed some dead leftover pandas code from test
Apr 9, 2019
c32ddad
removed pathlib from test
Apr 9, 2019
dab067f
tests work in python2 locally now
Apr 12, 2019
dcc77ef
Merge branch 'master' of https://github.com/rare-technologies/gensim …
Apr 12, 2019
eb9ea27
updated ensemble test reference model
Apr 12, 2019
6b0dc77
passing tox8
Apr 12, 2019
6dc6001
improved determinism of methods
Apr 13, 2019
3ec31e7
improved order of assertions
Apr 13, 2019
7afd192
trying to achieve higher precision with float64 to avoid some sorting…
Apr 13, 2019
16d0357
better approach for comparing with pretrained model
Apr 14, 2019
01b68e4
potentially fixing the tests on windows
Apr 14, 2019
9314cb4
potentially fixing the tests on windows
Apr 14, 2019
b282393
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Apr 20, 2019
0b7febc
changed citation of opinosis
Apr 21, 2019
60a717d
tox8 test passing after small change on opinosis comments/citation
Apr 22, 2019
2ff60ca
Moving max_random_state inside the model as a private variable.
aloosley Jun 25, 2019
d36fe43
removed whitespace
aloosley Jun 25, 2019
1507adf
docstring width
Jun 25, 2019
7577aca
sphinx udpate
aloosley Jun 25, 2019
301feac
fixed urls to sphinx notation
Jun 25, 2019
64b157e
Merge branch 'EnsembleLda_ReviewJune2019' of https://github.com/DataR…
Jun 25, 2019
0f4a6b8
changed doc strings, number --> int + some sphinx
aloosley Jun 25, 2019
b85fd95
Merge branch 'EnsembleLda_ReviewJune2019' of https://github.com/DataR…
Jun 25, 2019
f915f50
Removed hanging indents.
aloosley Jun 25, 2019
a3161cd
improved topic_model_kind type checking
Jun 25, 2019
c63c889
merge
Jun 25, 2019
52c239b
Sphinx and docstring updates.
aloosley Jun 25, 2019
cb362b5
Merge branch 'EnsembleLda_ReviewJune2019' of github.com:DataReply/gen…
aloosley Jun 25, 2019
ffc8e10
review stuff
Jun 25, 2019
e54e78c
Merge branch 'EnsembleLda_ReviewJune2019' of https://github.com/DataR…
Jun 25, 2019
2d269a5
removed unneccessary comments
Jun 25, 2019
0612c4a
Update gensim/models/ensemblelda.py
Jun 25, 2019
24b34b1
removed paranthesis
Jun 25, 2019
d96e1a1
review
Jun 25, 2019
a1e3d95
refactor private, hanging indent
Jun 25, 2019
5d48c8d
typo
Jun 25, 2019
e556e8c
Clarifications to ttda in docstrings and in method docstrings.
aloosley Jun 25, 2019
4ac43d0
solved merge conflict
aloosley Jun 25, 2019
7603045
merge conflict fixed
aloosley Jul 29, 2019
dc566f3
docstrings, masks explained and mask warning removed
Jul 29, 2019
9d57533
created internal variable for cosine distance calculations
Jul 29, 2019
d27cd59
cbdbscan docstring
Aug 28, 2019
42aa7ad
moved validate_core outside
Aug 28, 2019
eaf62d6
added citation note
aloosley Aug 28, 2019
2e2eb16
moved more stuff outside of _generate_stable_topics
Aug 28, 2019
5d23f1c
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
Aug 28, 2019
0a6c1f6
typos
Aug 28, 2019
0002982
explained CBDBSCAN
aloosley Aug 28, 2019
954659a
merged to remote --> CBDBSCAN explanation
aloosley Aug 28, 2019
9675016
added extra explanation:
aloosley Aug 28, 2019
b53704a
using none instead of nan for unchecked core
Aug 28, 2019
06c5659
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
Aug 28, 2019
4f3de96
updated docs
aloosley Aug 28, 2019
71c083c
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley Aug 28, 2019
2ed2cfe
refactored kind to class, fixed check how to proceed with topic_model…
Aug 28, 2019
f39d09c
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
Aug 28, 2019
2abce75
reverted change that accidentally broke things
Aug 28, 2019
511eaa5
fixed tests locally
Sep 8, 2019
96d6fbd
fix code style
Sep 8, 2019
819a05f
added _is_easy_valid_cluster
Sep 11, 2019
e3025f6
updated thesis reference
Sep 11, 2019
95e1b79
updated notebook example, typo
Sep 12, 2019
cd7934d
Merge branch 'master' of https://github.com/RaRe-Technologies/gensim …
Sep 14, 2019
3d11649
docstring styles, renaming, cleanup, stuff I need to discuss first
Sep 14, 2019
71b3825
tox
Sep 14, 2019
2d184a4
fixed stuff in CBDBSCAN
Sep 15, 2019
b0c5155
removed unused results column and only CB-Distance to other cores
Sep 15, 2019
70a0660
tox whitespace
Sep 15, 2019
c8b23ab
cleaned obsolete stuff from cbdbscan
Sep 15, 2019
6f98624
idk
Oct 27, 2019
799c112
updated doc-strings to be clearer and better reflect the truth
aloosley Oct 28, 2019
9866ef6
make flake8 happy
mpenkov Nov 9, 2019
ad97ead
fix trailing whitespace
mpenkov Nov 9, 2019
765b912
reverted some changes
Nov 11, 2019
32ff401
comma, newline, comment
Nov 11, 2019
b00e010
whitespace
Nov 11, 2019
c58c30e
citation, reference, authors
Nov 11, 2019
637640c
potential fix for utils saveload when a class is in __dict__
Nov 11, 2019
d7efe3a
commented out eLDA tests, tox8
Nov 11, 2019
644f2e4
saving the topic_model_class using a string instead
Nov 23, 2019
a7bbfc0
reverted utils
Nov 23, 2019
2abdbf7
saving the topic_model_class using a string instead fixes
Nov 23, 2019
c559de6
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Nov 23, 2019
254ca60
tox
Nov 23, 2019
aec58c9
quotes in logger.error
Nov 23, 2019
f7de190
multiline string
Nov 23, 2019
dfc97a0
python 3.5 format strings
Nov 23, 2019
88b338b
ModuleNotFoundError: No module named 'numpy.random._pickle'
Nov 23, 2019
b96c041
ModuleNotFoundError: No module named 'numpy.random._pickle' x2
Nov 23, 2019
584cd70
fixed inference
Dec 14, 2019
c1cd036
Merge branch 'EnsembleLda' of https://github.com/DataReply/gensim int…
Dec 14, 2019
766f562
removed print asdf
Dec 14, 2019
46b7cf6
added spec for inference
Dec 14, 2019
6689eae
tox
Dec 14, 2019
b1f596d
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Jan 1, 2020
80f4f04
lazy loading topic_model_class
Jan 23, 2020
b05bf11
tox
Jan 23, 2020
39a9b62
removed debug thing
Jan 23, 2020
09ded13
Documents now compile
aloosley Jan 23, 2020
9685b00
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley Jan 23, 2020
4bd7cf7
escape sequence thing indent fix
Jan 23, 2020
49253df
Better document rendering and added opinosiscorpus to apirefs
aloosley Jan 23, 2020
50fa88b
docstring styling on opinosiscorpus.py
Jan 23, 2020
5669f58
citation opinosis
Jan 23, 2020
f63eb03
Merge remote-tracking branch 'remotes/original/develop' into EnsembleLda
Jan 23, 2020
2aabe8f
missing opinosiscorpus.rst file committed
aloosley Jan 24, 2020
f0600dd
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley Jan 24, 2020
9c25cf5
p names refactored to be descriptive, now using append for appending …
aloosley Feb 6, 2020
4079c27
Changing to hanging indents where they were not used before
aloosley Feb 6, 2020
f5379ff
Adding :meth: and `` `` styling for RST
aloosley Feb 6, 2020
591cf77
a bunch of reviews
Feb 6, 2020
1fbafcb
merge
Feb 6, 2020
7366495
Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim…
Feb 6, 2020
f1aba3e
* Changed ensemblelda default to use ldamulticore instead of old lda …
aloosley Feb 6, 2020
9833e44
Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…
aloosley Feb 6, 2020
a9428de
More docstring polish
aloosley Feb 6, 2020
3bdeaf2
removing some camel-case vars for pep8 compliance.
aloosley Feb 6, 2020
806952e
a bunch of reviews
Feb 6, 2020
e1344bd
merge
Feb 6, 2020
3d24f62
fixed linter
Feb 9, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
183 changes: 183 additions & 0 deletions docs/notebooks/ensemble_lda_with_opinosis.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"####\n"
]
}
],
"source": [
"import logging\n",
"from gensim.models import EnsembleLda, LdaMulticore\n",
"from gensim.corpora import OpinosisCorpus\n",
"import os"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"enable the ensemble logger to show what it is doing currently"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"elda_logger = logging.getLogger(EnsembleLda.__module__)\n",
"elda_logger.setLevel(logging.INFO)\n",
"elda_logger.addHandler(logging.StreamHandler())"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def pretty_print_topics():\n",
" # note that the words are stemmed so they appear chopped off\n",
" for t in elda.print_topics(num_words=7):\n",
" print('-', t[1].replace('*',' ').replace('\"','').replace(' +',','), '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Experiments on the Opinosis Dataset\n",
"\n",
"Opinosis [1] is a small (but redundant) corpus that contains 289 product reviews for 51 products. Since it's so small, the results are rather unstable.\n",
"\n",
"[1] Kavita Ganesan, ChengXiang Zhai, and Jiawei Han, _Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions [online],_ Proceedings of the 23rd International Conference on Computational Linguistics, Association for Computational Linguistics, 2010, pp. 340–348. Available from: https://kavita-ganesan.com/opinosis/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preparing the corpus\n",
"\n",
"First, download the opinosis dataset. On linux it can be done like this for example:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!mkdir ~/opinosis\n",
"!wget -P ~/opinosis https://github.com/kavgan/opinosis/raw/master/OpinosisDataset1.0_0.zip\n",
"!unzip ~/opinosis/OpinosisDataset1.0_0.zip -d ~/opinosis"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"path = os.path.expanduser('~/opinosis/')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Corpus and id2word mapping can be created using the load_opinosis_data function provided in the package.\n",
"It preprocesses the data using the PorterStemmer and stopwords from the nltk package.\n",
"\n",
"The parameter of the function is the relative path to the folder, into which the zip file was extracted before. That folder contains a 'summaries-gold' subfolder."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"opinosis = OpinosisCorpus(path)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Training"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**parameters**\n",
"\n",
"**topic_model_kind** ldamulticore is highly recommended for EnsembleLda. ensemble_workers and **distance_workers** are used to improve the time needed to train the models, as well as the **masking_method** 'rank'. ldamulticore is not able to fully utilize all cores on this small corpus, so **ensemble_workers** can be set to 3 to get 95 - 100% cpu usage on my i5 3470.\n",
"\n",
"Since the corpus is so small, a high number of **num_models** is needed to extract stable topics. The Opinosis corpus contains 51 categories, however, some of them are quite similar. For example there are 3 categories about the batteries of portable products. There are also multiple categories about cars. So I chose 20 for num_topics, which is smaller than the number of categories."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"elda = EnsembleLda(corpus=opinosis.corpus, id2word=opinosis.id2word, num_models=128, num_topics=20,\n",
" passes=20, iterations=100, ensemble_workers=3, distance_workers=4,\n",
" topic_model_class='ldamulticore', masking_method='rank')\n",
"pretty_print_topics()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The default for **min_samples** would be 64, half of the number of models and **eps** would be 0.1. You basically play around with them until you find a sweetspot that fits for your needs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"elda.recluster(min_samples=55, eps=0.14)\n",
"pretty_print_topics()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
2 changes: 2 additions & 0 deletions docs/src/apiref.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,13 +22,15 @@ Modules:
corpora/malletcorpus
corpora/mmcorpus
corpora/_mmreader
corpora/opinosiscorpus
corpora/sharded_corpus
corpora/svmlightcorpus
corpora/textcorpus
corpora/ucicorpus
corpora/wikicorpus
models/ldamodel
models/ldamulticore
models/ensemblelda
models/nmf
models/lsimodel
models/ldaseqmodel
Expand Down
Loading