Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Fixes incorrect vectors learned during online training for FastText models #1756

Merged
merged 3 commits into from
Dec 5, 2017
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion gensim/models/fasttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,7 +228,7 @@ def __getitem__(self, word):

def get_vocab_word_vecs(self):
for w, v in self.wv.vocab.items():
word_vec = self.wv.syn0_vocab[v.index]
word_vec = np.copy(self.wv.syn0_vocab[v.index])
ngrams = self.wv.ngrams_word[w]
ngram_weights = self.wv.syn0_ngrams
for ngram in ngrams:
Expand Down
2 changes: 2 additions & 0 deletions gensim/test/test_fasttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -419,6 +419,8 @@ def online_sanity(self, model):
self.assertTrue(all(['terrorism' not in l for l in others]))
model.build_vocab(others)
model.train(others, total_examples=model.corpus_count, epochs=model.iter)
# checks that `syn0` is different from `syn0_vocab`
self.assertFalse(np.all(np.equal(model.wv.syn0, model.wv.syn0_vocab)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rather weird check. Nobody really cares if syn0 or syn0_vocab are different. What we do care about is that syn0_vocab is incorrect because it gets overwritten by get_vocab_word_vecs(). A more meaningful check would be either asserting certain properties of syn0_vocab that should hold or checking that syn0_vocab is the same before and after calling get_vocab_word_vecs().

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What @janpom said + check out np.allclose :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add another test case but let us also retain the np.equal check for syn0 and syn0_vocab.

self.assertFalse('terrorism' in model.wv.vocab)
self.assertFalse('orism>' in model.wv.ngrams)
model.build_vocab(terro, update=True) # update vocab
Expand Down