Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor API reference gensim.topic_coherence. Fix #1669 #1714

Merged
merged 42 commits into from
Jan 10, 2018
Merged
Changes from 1 commit
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
29a8a37
Refactored aggregation
CLearERR Nov 13, 2017
56eda23
Micro-Fix for aggregation.py, partially refactored direct_confirmatio…
CLearERR Nov 14, 2017
edd53d4
Partially refactored indirect_confirmation_measure
CLearERR Nov 15, 2017
cfd6050
Some additions
CLearERR Nov 16, 2017
390b01e
Math attempts
CLearERR Nov 19, 2017
8b1a5ca
add math extension for sphinx
menshikh-iv Nov 20, 2017
8d2c584
Minor refactoring
CLearERR Nov 21, 2017
6eb8335
Some refactoring for probability_estimation
CLearERR Nov 22, 2017
7a47f05
Beta-strings
CLearERR Nov 23, 2017
667cad2
Different additions
CLearERR Nov 25, 2017
d41c5a3
Minor changes
CLearERR Nov 26, 2017
180c1c1
text_analysis left
CLearERR Nov 27, 2017
e3c1e29
Added example for ContextVectorComputer class
CLearERR Nov 28, 2017
da9ca29
probability_estimation 0.9
CLearERR Nov 29, 2017
f54fb0c
beta_version
CLearERR Nov 30, 2017
47ee63e
Added some examples for text_analysis
CLearERR Dec 3, 2017
65211f0
text_analysis: corrected example for class UsesDictionary
CLearERR Dec 4, 2017
c484962
Final additions for text_analysis.py
CLearERR Dec 7, 2017
71bb2bf
Merge branch 'develop' into fix-1669
menshikh-iv Dec 11, 2017
d9237ea
fix cross-reference problem
menshikh-iv Dec 11, 2017
275edd0
fix pep8
menshikh-iv Dec 11, 2017
94bde33
fix aggregation
menshikh-iv Dec 11, 2017
782d5cf
fix direct_confirmation_measure
menshikh-iv Dec 11, 2017
81732ef
fix types in direct_confirmation_measure
menshikh-iv Dec 11, 2017
3c7b401
partial fix indirect_confirmation_measure
menshikh-iv Dec 11, 2017
206784d
HotFix for probability_estimation and segmentation
CLearERR Dec 12, 2017
406ab5c
Merge branch 'fix-1669' of https://github.com/CLearERR/gensim into fi…
CLearERR Dec 12, 2017
67962be
Refactoring for probability_estimation
CLearERR Dec 12, 2017
74c5c86
Changes for indirect_confirmation_measure
CLearERR Dec 14, 2017
ef058df
Fixed segmentation, partly fixed text_analysis
CLearERR Dec 18, 2017
0b06468
Add Notes for text_analysis
CLearERR Dec 18, 2017
e3779d4
fix di/ind
menshikh-iv Dec 19, 2017
482377b
fix doc examples in probability_estimation
menshikh-iv Dec 19, 2017
acdebb1
fix probability_estimation
menshikh-iv Dec 20, 2017
8a07dee
fix segmentation
menshikh-iv Dec 20, 2017
63c35c2
fix docstring in probability_estimation
menshikh-iv Dec 20, 2017
4b63f6c
partial fix test_analysis
menshikh-iv Dec 20, 2017
540021c
add latex stuff for docs build
menshikh-iv Dec 20, 2017
790e07d
merge upstream
menshikh-iv Jan 10, 2018
965587b
doc fix[1]
menshikh-iv Jan 10, 2018
f8f25cb
doc fix[2]
menshikh-iv Jan 10, 2018
f42ad8f
remove apt install from travis (now doc build in circle)
menshikh-iv Jan 10, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Micro-Fix for aggregation.py, partially refactored direct_confirmatio…
…n.py
CLearERR committed Nov 14, 2017

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
commit 56eda2314678d83b336812cfb5e37b30d0be7d52
2 changes: 1 addition & 1 deletion gensim/topic_coherence/aggregation.py
Original file line number Diff line number Diff line change
@@ -25,7 +25,7 @@ def arithmetic_mean(confirmed_measures):

Returns
-------
float
numpy.float
Arithmetic mean of all the values contained in confirmation measures.

Examples
115 changes: 82 additions & 33 deletions gensim/topic_coherence/direct_confirmation_measure.py
Original file line number Diff line number Diff line change
@@ -19,22 +19,43 @@

def log_conditional_probability(segmented_topics, accumulator, with_std=False, with_support=False):
"""
This function calculates the log-conditional-probability measure
Calculate the log-conditional-probability measure
which is used by coherence measures such as U_mass.
This is defined as: m_lc(S_i) = log[(P(W', W*) + e) / P(W*)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Args:
segmented_topics (list): Output from the segmentation module of the segmented
topics. Is a list of list of tuples.
accumulator: word occurrence accumulator from probability_estimation.
with_std (bool): True to also include standard deviation across topic segment
sets in addition to the mean coherence for each topic; default is False.
with_support (bool): True to also include support across topic segments. The
support is defined as the number of pairwise similarity comparisons were
used to compute the overall topic coherence.

Returns:
Parameters
----------
segmented_topics : list
Output from the segmentation module of the segmented topics. Is a list of list of tuples.
accumulator : list
Word occurrence accumulator from probability_estimation.
with_std : bool
True to also include standard deviation across topic segment
sets in addition to the mean coherence for each topic; default is False.
with_support : bool
True to also include support across topic segments. The
support is defined as the number of pairwise similarity comparisons were
used to compute the overall topic coherence.

Returns
-------
list : of log conditional probability measure for each topic.

Examples
--------
>>> from gensim.topic_coherence import direct_confirmation_measure,text_analysis
>>> from collections import namedtuple
>>> id2token = {1: 'test', 2: 'doc'}
>>> token2id = {v: k for k, v in id2token.items()}
>>> dictionary = namedtuple('Dictionary', 'token2id, id2token')(token2id, id2token)
>>> segmentation = [[(1, 2)]]
>>> num_docs = 5
>>> accumulator = text_analysis.InvertedIndexAccumulator({1, 2}, dictionary)
>>> accumulator._inverted_index = {0: {2, 3, 4}, 1: {3, 5}}
>>> accumulator._num_docs = num_docs
>>> direct_confirmation_measure.log_conditional_probability(segmentation, accumulator)[0]
Answer should be ~ ln(1 / 2) = -0.693147181

"""
topic_coherences = []
num_docs = float(accumulator.num_docs)
@@ -59,14 +80,20 @@ def aggregate_segment_sims(segment_sims, with_std, with_support):
"""Compute various statistics from the segment similarities generated via
set pairwise comparisons of top-N word lists for a single topic.

Args:
segment_sims (iterable): floating point similarity values to aggregate.
with_std (bool): Set to True to include standard deviation.
with_support (bool): Set to True to include number of elements in `segment_sims`
as a statistic in the results returned.
Parameters
----------
segment_sims : iterable
floating point similarity values to aggregate.
with_std : bool
Set to True to include standard deviation.
with_support : bool
Set to True to include number of elements in `segment_sims` as a statistic in the results returned.

Returns
-------
tuple
tuple with (mean[, std[, support]])

Returns:
tuple: with (mean[, std[, support]])
"""
mean = np.mean(segment_sims)
stats = [mean]
@@ -83,27 +110,49 @@ def log_ratio_measure(
"""
If normalize=False:
Popularly known as PMI.
This function calculates the log-ratio-measure which is used by
Calculate the log-ratio-measure which is used by
coherence measures such as c_v.
This is defined as: m_lr(S_i) = log[(P(W', W*) + e) / (P(W') * P(W*))]

If normalize=True:
This function calculates the normalized-log-ratio-measure, popularly knowns as
Calculate the normalized-log-ratio-measure, popularly knowns as
NPMI which is used by coherence measures such as c_v.
This is defined as: m_nlr(S_i) = m_lr(S_i) / -log[P(W', W*) + e]

Args:
segmented_topics (list): Output from the segmentation module of the segmented
topics. Is a list of list of tuples.
accumulator: word occurrence accumulator from probability_estimation.
with_std (bool): True to also include standard deviation across topic segment
sets in addition to the mean coherence for each topic; default is False.
with_support (bool): True to also include support across topic segments. The
support is defined as the number of pairwise similarity comparisons were
used to compute the overall topic coherence.

Returns:
list : of log ratio measure for each topic.
Parameters
----------
segmented_topics : list
Output from the segmentation module of the segmented topics. Is a list of list of tuples.
accumulator: list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list of ?

word occurrence accumulator from probability_estimation.
with_std : bool
True to also include standard deviation across topic segment
sets in addition to the mean coherence for each topic; default is False.
with_support : bool
True to also include support across topic segments. The
support is defined as the number of pairwise similarity comparisons were
used to compute the overall topic coherence.

Returns
-------
list
List of log ratio measure for each topic.

Examples
--------
>>> from gensim.topic_coherence import direct_confirmation_measure,text_analysis
>>> from collections import namedtuple
>>> id2token = {1: 'test', 2: 'doc'}
>>> token2id = {v: k for k, v in id2token.items()}
>>> dictionary = namedtuple('Dictionary', 'token2id, id2token')(token2id, id2token)
>>> segmentation = [[(1, 2)]]
>>> num_docs = 5
>>> accumulator = text_analysis.InvertedIndexAccumulator({1, 2}, dictionary)
>>> accumulator._inverted_index = {0: {2, 3, 4}, 1: {3, 5}}
>>> accumulator._num_docs = num_docs
>>> direct_confirmation_measure.log_ratio_measure(segmentation, accumulator)[0]
Answer should be ~ ln{(1 / 5) / [(3 / 5) * (2 / 5)]} = -0.182321557

"""
topic_coherences = []
num_docs = float(accumulator.num_docs)