- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor API reference gensim.topic_coherence. Fix #1669 #1714
Merged
+748
−261
Merged
Changes from 3 commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
29a8a37
Refactored aggregation
CLearERR 56eda23
Micro-Fix for aggregation.py, partially refactored direct_confirmatio…
CLearERR edd53d4
Partially refactored indirect_confirmation_measure
CLearERR cfd6050
Some additions
CLearERR 390b01e
Math attempts
CLearERR 8b1a5ca
add math extension for sphinx
menshikh-iv 8d2c584
Minor refactoring
CLearERR 6eb8335
Some refactoring for probability_estimation
CLearERR 7a47f05
Beta-strings
CLearERR 667cad2
Different additions
CLearERR d41c5a3
Minor changes
CLearERR 180c1c1
text_analysis left
CLearERR e3c1e29
Added example for ContextVectorComputer class
CLearERR da9ca29
probability_estimation 0.9
CLearERR f54fb0c
beta_version
CLearERR 47ee63e
Added some examples for text_analysis
CLearERR 65211f0
text_analysis: corrected example for class UsesDictionary
CLearERR c484962
Final additions for text_analysis.py
CLearERR 71bb2bf
Merge branch 'develop' into fix-1669
menshikh-iv d9237ea
fix cross-reference problem
menshikh-iv 275edd0
fix pep8
menshikh-iv 94bde33
fix aggregation
menshikh-iv 782d5cf
fix direct_confirmation_measure
menshikh-iv 81732ef
fix types in direct_confirmation_measure
menshikh-iv 3c7b401
partial fix indirect_confirmation_measure
menshikh-iv 206784d
HotFix for probability_estimation and segmentation
CLearERR 406ab5c
Merge branch 'fix-1669' of https://github.com/CLearERR/gensim into fi…
CLearERR 67962be
Refactoring for probability_estimation
CLearERR 74c5c86
Changes for indirect_confirmation_measure
CLearERR ef058df
Fixed segmentation, partly fixed text_analysis
CLearERR 0b06468
Add Notes for text_analysis
CLearERR e3779d4
fix di/ind
menshikh-iv 482377b
fix doc examples in probability_estimation
menshikh-iv acdebb1
fix probability_estimation
menshikh-iv 8a07dee
fix segmentation
menshikh-iv 63c35c2
fix docstring in probability_estimation
menshikh-iv 4b63f6c
partial fix test_analysis
menshikh-iv 540021c
add latex stuff for docs build
menshikh-iv 790e07d
merge upstream
menshikh-iv 965587b
doc fix[1]
menshikh-iv f8f25cb
doc fix[2]
menshikh-iv f42ad8f
remove apt install from travis (now doc build in circle)
menshikh-iv File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,10 +4,8 @@ | |
# Copyright (C) 2013 Radim Rehurek <[email protected]> | ||
# Licensed under the GNU LGPL v2.1 - http://www.gnu.org/licenses/lgpl.html | ||
|
||
""" | ||
This module contains functions to perform aggregation on a list of values | ||
obtained from the confirmation measure. | ||
""" | ||
"""This module contains functions to perform aggregation on a list of values | ||
obtained from the confirmation measure.""" | ||
|
||
import logging | ||
import numpy as np | ||
|
@@ -17,13 +15,24 @@ | |
|
||
def arithmetic_mean(confirmed_measures): | ||
""" | ||
This functoin performs the arithmetic mean aggregation on the output obtained from | ||
Perform the arithmetic mean aggregation on the output obtained from | ||
the confirmation measure module. | ||
|
||
Args: | ||
confirmed_measures : list of calculated confirmation measure on each set in the segmented topics. | ||
Parameters | ||
---------- | ||
confirmed_measures : list | ||
List of calculated confirmation measure on each set in the segmented topics. | ||
|
||
Returns | ||
------- | ||
numpy.float | ||
Arithmetic mean of all the values contained in confirmation measures. | ||
|
||
Examples | ||
-------- | ||
>>> from gensim.topic_coherence.aggregation import arithmetic_mean | ||
>>> arithmetic_mean([1.1, 2.2, 3.3, 4.4]) | ||
2.75 | ||
|
||
Returns: | ||
mean : Arithmetic mean of all the values contained in confirmation measures. | ||
""" | ||
return np.mean(confirmed_measures) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,22 +19,43 @@ | |
|
||
def log_conditional_probability(segmented_topics, accumulator, with_std=False, with_support=False): | ||
""" | ||
This function calculates the log-conditional-probability measure | ||
Calculate the log-conditional-probability measure | ||
which is used by coherence measures such as U_mass. | ||
This is defined as: m_lc(S_i) = log[(P(W', W*) + e) / P(W*)] | ||
|
||
Args: | ||
segmented_topics (list): Output from the segmentation module of the segmented | ||
topics. Is a list of list of tuples. | ||
accumulator: word occurrence accumulator from probability_estimation. | ||
with_std (bool): True to also include standard deviation across topic segment | ||
sets in addition to the mean coherence for each topic; default is False. | ||
with_support (bool): True to also include support across topic segments. The | ||
support is defined as the number of pairwise similarity comparisons were | ||
used to compute the overall topic coherence. | ||
|
||
Returns: | ||
Parameters | ||
---------- | ||
segmented_topics : list | ||
Output from the segmentation module of the segmented topics. Is a list of list of tuples. | ||
accumulator : list | ||
Word occurrence accumulator from probability_estimation. | ||
with_std : bool | ||
True to also include standard deviation across topic segment | ||
sets in addition to the mean coherence for each topic; default is False. | ||
with_support : bool | ||
True to also include support across topic segments. The | ||
support is defined as the number of pairwise similarity comparisons were | ||
used to compute the overall topic coherence. | ||
|
||
Returns | ||
------- | ||
list : of log conditional probability measure for each topic. | ||
|
||
Examples | ||
-------- | ||
>>> from gensim.topic_coherence import direct_confirmation_measure,text_analysis | ||
>>> from collections import namedtuple | ||
>>> id2token = {1: 'test', 2: 'doc'} | ||
>>> token2id = {v: k for k, v in id2token.items()} | ||
>>> dictionary = namedtuple('Dictionary', 'token2id, id2token')(token2id, id2token) | ||
>>> segmentation = [[(1, 2)]] | ||
>>> num_docs = 5 | ||
>>> accumulator = text_analysis.InvertedIndexAccumulator({1, 2}, dictionary) | ||
>>> accumulator._inverted_index = {0: {2, 3, 4}, 1: {3, 5}} | ||
>>> accumulator._num_docs = num_docs | ||
>>> direct_confirmation_measure.log_conditional_probability(segmentation, accumulator)[0] | ||
Answer should be ~ ln(1 / 2) = -0.693147181 | ||
|
||
""" | ||
topic_coherences = [] | ||
num_docs = float(accumulator.num_docs) | ||
|
@@ -59,14 +80,20 @@ def aggregate_segment_sims(segment_sims, with_std, with_support): | |
"""Compute various statistics from the segment similarities generated via | ||
set pairwise comparisons of top-N word lists for a single topic. | ||
|
||
Args: | ||
segment_sims (iterable): floating point similarity values to aggregate. | ||
with_std (bool): Set to True to include standard deviation. | ||
with_support (bool): Set to True to include number of elements in `segment_sims` | ||
as a statistic in the results returned. | ||
Parameters | ||
---------- | ||
segment_sims : iterable | ||
floating point similarity values to aggregate. | ||
with_std : bool | ||
Set to True to include standard deviation. | ||
with_support : bool | ||
Set to True to include number of elements in `segment_sims` as a statistic in the results returned. | ||
|
||
Returns | ||
------- | ||
tuple | ||
tuple with (mean[, std[, support]]) | ||
|
||
Returns: | ||
tuple: with (mean[, std[, support]]) | ||
""" | ||
mean = np.mean(segment_sims) | ||
stats = [mean] | ||
|
@@ -83,27 +110,49 @@ def log_ratio_measure( | |
""" | ||
If normalize=False: | ||
Popularly known as PMI. | ||
This function calculates the log-ratio-measure which is used by | ||
Calculate the log-ratio-measure which is used by | ||
coherence measures such as c_v. | ||
This is defined as: m_lr(S_i) = log[(P(W', W*) + e) / (P(W') * P(W*))] | ||
|
||
If normalize=True: | ||
This function calculates the normalized-log-ratio-measure, popularly knowns as | ||
Calculate the normalized-log-ratio-measure, popularly knowns as | ||
NPMI which is used by coherence measures such as c_v. | ||
This is defined as: m_nlr(S_i) = m_lr(S_i) / -log[P(W', W*) + e] | ||
|
||
Args: | ||
segmented_topics (list): Output from the segmentation module of the segmented | ||
topics. Is a list of list of tuples. | ||
accumulator: word occurrence accumulator from probability_estimation. | ||
with_std (bool): True to also include standard deviation across topic segment | ||
sets in addition to the mean coherence for each topic; default is False. | ||
with_support (bool): True to also include support across topic segments. The | ||
support is defined as the number of pairwise similarity comparisons were | ||
used to compute the overall topic coherence. | ||
|
||
Returns: | ||
list : of log ratio measure for each topic. | ||
Parameters | ||
---------- | ||
segmented_topics : list of (list of tuples) | ||
Output from the segmentation module of the segmented topics. | ||
accumulator: list | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. list of |
||
word occurrence accumulator from probability_estimation. | ||
with_std : bool | ||
True to also include standard deviation across topic segment | ||
sets in addition to the mean coherence for each topic; default is False. | ||
with_support : bool | ||
True to also include support across topic segments. The | ||
support is defined as the number of pairwise similarity comparisons were | ||
used to compute the overall topic coherence. | ||
|
||
Returns | ||
------- | ||
list | ||
List of log ratio measure for each topic. | ||
|
||
Examples | ||
-------- | ||
>>> from gensim.topic_coherence import direct_confirmation_measure,text_analysis | ||
>>> from collections import namedtuple | ||
>>> id2token = {1: 'test', 2: 'doc'} | ||
>>> token2id = {v: k for k, v in id2token.items()} | ||
>>> dictionary = namedtuple('Dictionary', 'token2id, id2token')(token2id, id2token) | ||
>>> segmentation = [[(1, 2)]] | ||
>>> num_docs = 5 | ||
>>> accumulator = text_analysis.InvertedIndexAccumulator({1, 2}, dictionary) | ||
>>> accumulator._inverted_index = {0: {2, 3, 4}, 1: {3, 5}} | ||
>>> accumulator._num_docs = num_docs | ||
>>> direct_confirmation_measure.log_ratio_measure(segmentation, accumulator)[0] | ||
Answer should be ~ ln{(1 / 5) / [(3 / 5) * (2 / 5)]} = -0.182321557 | ||
|
||
""" | ||
topic_coherences = [] | ||
num_docs = float(accumulator.num_docs) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use math here http://www.sphinx-doc.org/en/stable/ext/math.html