Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1533 fix and 1464 1423 comments #1573

Merged
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 78 additions & 46 deletions gensim/models/phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -64,8 +64,8 @@
import warnings
from collections import defaultdict
import itertools as it
from functools import partial
from math import log
from inspect import getargspec
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import not used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used in line 188 (in the commit your comments are on) to check for the proper parameters in the pluggable scoring function.

Copy link
Owner

@piskvorky piskvorky Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see it now. What is that check for though? Python is duck-typed by convention, so "type checks" are best postponed until truly needed (something breaks).

What is the rationale for this pre-emptive type check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly to save the stress that would result from improperly specifying a scoring function when initializing the phrases object. I know Python will do the type checking when the scoring function is called, but that won't happen until export_phrases or getitem is called. The "normal" workflow for the Phrases object is to just specify sentences on load, or to use add_vocab. Only after that does the scoring function get called.

I could easily see a user specifying a bad scoring method and then making the vocab dictionary from their large corpus. Only after significant time extracting vocab from a corpus do they then discover that something is wrong with how they specified scoring. At this point you could manually specify a correct scoring function, but that requires you to set it directly. Users also wouldn't have an easy bailout in the form of use one of the scorer string settings, since those are only checked when the Phrases object is created--the user would have to figure out how to specify those built in scorers which would mean opening up the code. This seems a bit user unfriendly, I feel it is friendlier to just do the type checking on initialization even if it is less Pythonic.

This could be fixed with a set_scorer method that takes the string or function input, but that seems a bit more awkward than just doing this type check.

There's also an issue with wanting to raise an informative exception when the scoring function is called in getitem or export_phrases and the types don't match, but that means adding a try/except into the main scoring loop and that seems awkward as well. I think its better to just do that try/except once when the object is initialized.

But I defer to your judgement on this--what do you think is best?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I see your argument (that checking early a little more convenient).

I'm not sure if it's worth it, but don't care much either way. I'll defer to @menshikh-iv :)


from six import iteritems, string_types, next

Expand Down Expand Up @@ -137,18 +137,31 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
should be a byte string (e.g. b'_').

`scoring` specifies how potential phrases are scored for comparison to the `threshold`
setting. two settings are available:
'default': from "Efficient Estimaton of Word Representations in Vector Space" by
Mikolov, et. al.:
(count(worda followed by wordb) - min_count) * N /
(count(worda) * count(wordb)) > threshold`, where `N` is the total vocabulary size.
'npmi': normalized pointwise mutual information, from "Normalized (Pointwise) Mutual
Information in Colocation Extraction" by Gerlof Bouma:
ln(prop(worda followed by wordb) / (prop(worda)*prop(wordb))) /
- ln(prop(worda followed by wordb)
where prop(n) is the count of n / the count of everything in the entire corpus
'npmi' is more robust when dealing with common words that form part of common bigrams, and
setting. `scoring` can be set with either a string that refers to a built-in scoring function,
or with a function with the expected parameter names. Two built-in scoring functions are available
by setting `scoring` to a string:
'default': from "Efficient Estimaton of Word Representations in Vector Space" by
Mikolov, et. al.:
(count(worda followed by wordb) - min_count) * N /
(count(worda) * count(wordb)) > threshold`, where `N` is the total vocabulary size.
'npmi': normalized pointwise mutual information, from "Normalized (Pointwise) Mutual
Information in Colocation Extraction" by Gerlof Bouma:
ln(prop(worda followed by wordb) / (prop(worda)*prop(wordb))) /
- ln(prop(worda followed by wordb)
where prop(n) is the count of n / the count of everything in the entire corpus
'npmi' is more robust when dealing with common words that form part of common bigrams, and
ranges from -1 to 1, but is slower to calculate than the default
To use a custom scoring function, create a function with the following parameters and set the `scoring`
parameter to the custom function. You must use all the parameters in your function call, even if the
function does not require all the parameters.
worda_count: number of occurrances in `sentences` of the first token in the phrase being scored
wordb_count: number of occurrances in `sentences` of the second token in the phrase being scored
bigram_count: number of occurrances in `sentences` of the phrase being scored
len_vocab: the number of unique tokens in `sentences`
min_count: the `min_count` setting of the Phrases class
corpus_word_count: the total number of (non-unique) tokens in `sentences`
A scoring function without any of these parameters (even if the parameters are not used) will
raise a ValueError on initialization of the Phrases class

"""
if min_count <= 0:
Expand All @@ -159,8 +172,23 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
if scoring == 'npmi' and (threshold < -1 or threshold > 1):
raise ValueError("threshold should be between -1 and 1 for npmi scoring")

if not (scoring == 'default' or scoring == 'npmi'):
raise ValueError('unknown scoring function "' + scoring + '" specified')
# set scoring based on string
# intentially override the value of the scoring parameter rather than set self.scoring here,
# to still run the check of scoring function parameters in the next code block
if type(scoring) is str:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if isinstance(scoring, basestring) more standard and future-proof.

if scoring == 'default':
scoring = original_scorer
elif scoring == 'npmi':
scoring = npmi_scorer
else:
raise ValueError('unknown scoring method string %s specified' % (scoring))

scoring_parameters = ['worda_count', 'wordb_count', 'bigram_count', 'len_vocab', 'min_count', 'corpus_word_count']
if callable(scoring):
if all(parameter in getargspec(scoring)[0] for parameter in scoring_parameters):
self.scoring = scoring
else:
raise ValueError('scoring function missing expected parameters')

self.min_count = min_count
self.threshold = threshold
Expand All @@ -169,7 +197,6 @@ def __init__(self, sentences=None, min_count=5, threshold=10.0,
self.min_reduce = 1 # ignore any tokens with count smaller than this
self.delimiter = delimiter
self.progress_per = progress_per
self.scoring = scoring
self.corpus_word_count = 0

if sentences is not None:
Expand Down Expand Up @@ -222,8 +249,7 @@ def add_vocab(self, sentences):
# directly, but gives the new sentences a fighting chance to collect
# sufficient counts, before being pruned out by the (large) accummulated
# counts collected in previous learn_vocab runs.
min_reduce, vocab, total_words = \
self.learn_vocab(sentences, self.max_vocab_size, self.delimiter, self.progress_per)
min_reduce, vocab, total_words = self.learn_vocab(sentences, self.max_vocab_size, self.delimiter, self.progress_per)

self.corpus_word_count += total_words
if len(self.vocab) > 0:
Expand Down Expand Up @@ -258,16 +284,13 @@ def export_phrases(self, sentences, out_delimiter=b' ', as_tuples=False):
threshold = self.threshold
delimiter = self.delimiter # delimiter used for lookup
min_count = self.min_count
scoring = self.scoring
corpus_word_count = self.corpus_word_count
scorer = self.scoring
# made floats for scoring function
len_vocab = float(len(vocab))
scorer_min_count = float(min_count)
corpus_word_count = float(self.corpus_word_count)


if scoring == 'default':
scoring_function = \
partial(self.original_scorer, len_vocab=float(len(vocab)), min_count=float(min_count))
elif scoring == 'npmi':
scoring_function = \
partial(self.npmi_scorer, corpus_word_count=corpus_word_count)
# no else here to catch unknown scoring function, check is done in Phrases.__init__

for sentence in sentences:
s = [utils.any2utf8(w) for w in sentence]
Expand All @@ -281,11 +304,10 @@ def export_phrases(self, sentences, out_delimiter=b' ', as_tuples=False):
count_a = float(vocab[word_a])
count_b = float(vocab[word_b])
count_ab = float(vocab[bigram_word])
score = scoring_function(count_a, count_b, count_ab)
# scoring MUST have all these parameters, even if they are not used
score = scorer(worda_count=count_a, wordb_count=count_b, bigram_count=count_ab, len_vocab=len_vocab, min_count=scorer_min_count, corpus_word_count=corpus_word_count)
# logger.debug("score for %s: (pab=%s - min_count=%s) / pa=%s / pb=%s * vocab_size=%s = %s",
# bigram_word, pab, self.min_count, pa, pb, len(self.vocab), score)
# added mincount check because if the scorer doesn't contain min_count
# it would not be enforced otherwise
# bigram_word, count_ab, scorer_min_count, count_a, count_ab, len_vocab, score)
if score > threshold and count_ab >= min_count:
if as_tuples:
yield ((word_a, word_b), score)
Expand Down Expand Up @@ -316,6 +338,16 @@ def __getitem__(self, sentence):
"""
warnings.warn("For a faster implementation, use the gensim.models.phrases.Phraser class")

vocab = self.vocab
threshold = self.threshold
delimiter = self.delimiter # delimiter used for lookup
min_count = self.min_count
scorer = self.scoring
# made floats for scoring function
len_vocab = float(len(vocab))
scorer_min_count = float(min_count)
corpus_word_count = float(self.corpus_word_count)

is_single, sentence = _is_single(sentence)
if not is_single:
# if the input is an entire corpus (rather than a single sentence),
Expand All @@ -325,20 +357,20 @@ def __getitem__(self, sentence):
s, new_s = [utils.any2utf8(w) for w in sentence], []
last_bigram = False
vocab = self.vocab
threshold = self.threshold
delimiter = self.delimiter
min_count = self.min_count

for word_a, word_b in zip(s, s[1:]):
if word_a in vocab and word_b in vocab:
# last bigram check was moved here to save a few CPU cycles
if word_a in vocab and word_b in vocab and not last_bigram:
bigram_word = delimiter.join((word_a, word_b))
if bigram_word in vocab and not last_bigram:
pa = float(vocab[word_a])
pb = float(vocab[word_b])
pab = float(vocab[bigram_word])
score = (pab - min_count) / pa / pb * len(vocab)
if bigram_word in vocab:
count_a = float(vocab[word_a])
count_b = float(vocab[word_b])
count_ab = float(vocab[bigram_word])
# scoring MUST have all these parameters, even if they are not used
score = scorer(worda_count=count_a, wordb_count=count_b, bigram_count=count_ab, len_vocab=len_vocab, min_count=scorer_min_count, corpus_word_count=corpus_word_count)
# logger.debug("score for %s: (pab=%s - min_count=%s) / pa=%s / pb=%s * vocab_size=%s = %s",
# bigram_word, pab, self.min_count, pa, pb, len(self.vocab), score)
if score > threshold:
# bigram_word, count_ab, scorer_min_count, count_a, count_ab, len_vocab, score)
if score > threshold and count_ab >= min_count:
new_s.append(bigram_word)
last_bigram = True
continue
Expand All @@ -354,15 +386,15 @@ def __getitem__(self, sentence):

return [utils.to_unicode(w) for w in new_s]

# these two built-in scoring methods don't cast everything to float because the casting is done in the call
# to the scoring method in __getitem__ and export_phrases.

# calculation of score based on original mikolov word2vec paper
# len_vocab and min_count set so functools.partial works
@staticmethod
def original_scorer(worda_count, wordb_count, bigram_count, len_vocab=0.0, min_count=0.0):
def original_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
return (bigram_count - min_count) / worda_count / wordb_count * len_vocab
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad indent.


# normalized PMI, requires corpus size
@staticmethod
def npmi_scorer(worda_count, wordb_count, bigram_count, corpus_word_count=0.0):
def npmi_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
pa = worda_count / corpus_word_count
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about these, very sloppy on my part.

pb = wordb_count / corpus_word_count
pab = bigram_count / corpus_word_count
Expand Down
12 changes: 6 additions & 6 deletions gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -1563,7 +1563,7 @@ def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
"""
`source` should be a path to a directory (as a string) where all files can be opened by the
LineSentence class. Each file will be read up to
`limit` lines (or no clipped if limit is None, the default).
`limit` lines (or not clipped if limit is None, the default).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs are not clear -- does the "will process all files in a directory" work recursively?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not. Maybe wishlist? I've clarified the docs.


Example::

Expand All @@ -1577,23 +1577,23 @@ def __init__(self, source, max_sentence_length=MAX_WORDS_IN_BATCH, limit=None):
self.limit = limit

if os.path.isfile(self.source):
logging.warning('single file read, better to use models.word2vec.LineSentence')
logger.warning('single file read, better to use models.word2vec.LineSentence')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the class API contract supports it, this is no warning (maybe debug, at most).

If it's outside the API contract, this is an error and we should raise an exception, not log a warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified this message a bit, made it debug.

self.input_files = [self.source] # force code compatibility with list of files
elif os.path.isdir(self.source):
self.source = os.path.join(self.source, '') # ensures os-specific slash at end of path
logging.debug('reading directory ' + self.source)
logger.warning('reading directory %s', self.source)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a warning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a mistake, changing to info.

self.input_files = os.listdir(self.source)
self.input_files = [self.source + file for file in self.input_files] # make full paths
self.input_files = [self.source + filename for filename in self.input_files] # make full paths
self.input_files.sort() # makes sure it happens in filename order
else: # not a file or a directory, then we can't do anything with it
raise ValueError('input is neither a file nor a path')

logging.info('files read into PathLineSentences:' + '\n'.join(self.input_files))
logger.info('files read into PathLineSentences:%s', '\n'.join(self.input_files))

def __iter__(self):
'''iterate through the files'''
for file_name in self.input_files:
logging.info('reading file ' + file_name)
logger.info('reading file %s', file_name)
with utils.smart_open(file_name) as fin:
for line in itertools.islice(fin, self.limit):
line = utils.to_unicode(line).split()
Expand Down
26 changes: 25 additions & 1 deletion gensim/test/test_phrases.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,12 +169,20 @@ def testScoringDefault(self):
3.444 # score for human interface
])

def test__getitem__(self):
""" test Phrases[sentences] with a single sentence"""
bigram = Phrases(sentences, min_count=1, threshold=1)
# pdb.set_trace()
test_sentences = [['graph', 'minors', 'survey', 'human', 'interface']]
phrased_sentence = next(bigram[test_sentences].__iter__())

assert phrased_sentence == ['graph_minors', 'survey', 'human_interface']

def testScoringNpmi(self):
""" test normalized pointwise mutual information scoring """
bigram = Phrases(sentences, min_count=1, threshold=.5, scoring='npmi')

seen_scores = set()

test_sentences = [['graph', 'minors', 'survey', 'human', 'interface']]
for phrase, score in bigram.export_phrases(test_sentences):
seen_scores.add(round(score, 3))
Expand All @@ -184,6 +192,22 @@ def testScoringNpmi(self):
.714 # score for human interface
])

def testCustomScorer(self):
""" test using a custom scoring function """
# all scores will be 1
def dumb_scorer(worda_count, wordb_count, bigram_count, len_vocab, min_count, corpus_word_count):
return 1

bigram = Phrases(sentences, min_count=1, threshold=.001, scoring=dumb_scorer)

seen_scores = []
test_sentences = [['graph', 'minors', 'survey', 'human', 'interface', 'system']]
for phrase, score in bigram.export_phrases(test_sentences):
seen_scores.append(score)

assert all(seen_scores) # all scores 1
assert len(seen_scores) == 3 # 'graph minors' and 'survey human' and 'interface system'

def testBadParameters(self):
"""Test the phrases module with bad parameters."""
# should fail with something less or equal than 0
Expand Down