FastText inconsistent normalization in adjust_vectors method vs word_vec #2537

vackosar · 2019-06-25T18:37:18Z

Problem description

There is inconsistency in normalization in FastTextKeyedVectors#adjust_vectors vs WordEmbeddingsKeyedVectors#word_vec

Steps/code/corpus to reproduce

adjust_vectors:

            word_vec /= len(ngram_hashes) + 1

word_vec method:

            return word_vec / len(ngram_hashes)

Versions

Linux-4.18.0-24-generic-x86_64-with-Ubuntu-18.04-bionic
Python 3.7.1 (default, Oct 22 2018, 11:21:55) 
[GCC 8.2.0]
NumPy 1.16.3
SciPy 1.2.1
gensim 3.7.3
FAST_VERSION 1

The text was updated successfully, but these errors were encountered:

vackosar · 2019-06-27T08:47:46Z

The adjustment_vectors method adds subword representations to the words themselves, where +1 seems to represent original learned vector of the whole word.

piskvorky · 2019-06-27T09:10:17Z

@vackosar if that's the case, can you add a code comment to the relevant places? It might trip up others too, in the future, so let's be explicit. CC @mpenkov can you confirm?

mpenkov · 2019-06-28T03:19:02Z

I agree that the two implementations appear to be inconsistent. I think this is a bug.

This is what I rewrote when I worked on fastText a few months ago:

b452a5b59 (Michael Penkov       2019-01-12 00:07:40 +0900 2289)             word_vec = np.copy(self.vectors_vocab[v.index])
411f54663 (Michael Penkov       2019-01-24 13:51:40 +0900 2290)             ngram_hashes = ft_ngram_hashes(w, self.min_n, self.max_n, self.bucket, self.compatible_hash)
411f54663 (Michael Penkov       2019-01-24 13:51:40 +0900 2291)             for nh in ngram_hashes:
b1850d948 (Michael Penkov       2019-03-07 15:44:31 +0900 2292)                 word_vec += self.vectors_ngrams[nh]
411f54663 (Michael Penkov       2019-01-24 13:51:40 +0900 2293)             word_vec /= len(ngram_hashes) + 1
b452a5b59 (Michael Penkov       2019-01-12 00:07:40 +0900 2294)             self.vectors[v.index] = word_vec

This is the old version of the FastText code:

9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2001)             ngrams_found = 0
916e423a1 (Shiva Manne          2018-02-01 21:50:20 +0530 2002)             for ngram in ngrams:
9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2003)                 ngram_hash = _ft_hash(ngram) % self.bucket
9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2004)                 if ngram_hash in self.hash2index:
9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2005)                     word_vec += ngram_weights[self.hash2index[ngram_hash]]
9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2006)                     ngrams_found += 1
916e423a1 (Shiva Manne          2018-02-01 21:50:20 +0530 2007)             if word_vec.any():
9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2008)                 return word_vec / max(1, ngrams_found)
916e423a1 (Shiva Manne          2018-02-01 21:50:20 +0530 2009)             else:  # No ngrams of the word are present in self.ngrams
916e423a1 (Shiva Manne          2018-02-01 21:50:20 +0530 2010)                 raise KeyError('all ngrams for word %s absent from model' % word)

I think the change to make here is:

Before: word_vec /= len(ngram_hashes) + 1
After: word_vec /= max(len(ngram_hashes), 1)

piskvorky · 2019-06-28T08:08:17Z

@vackosar can you open a PR?

vackosar · 2019-07-04T05:43:25Z

@piskvorky @mpenkov I will submit the PR but are you sure about this?
In the original implementation the word itself seems to be appended to its n-grams and then everytime when there is a normalization they just divide by ngram.size() which would mean there is hidden "+1" there too. Unless u actually also have the word part of its own ngrams in that case you would be adding another "+1" here.

vackosar · 2019-07-04T05:45:58Z

By the way, I posted couple of FastText norm visualizations here, if you are interested.

…_vectors piskvorky#2537

vackosar · 2019-07-04T06:10:11Z

Here is the PR #2551

gojomo · 2020-06-19T07:33:34Z

There's no inconsistency. In the word_vec() method, the codepath where a simple return word_vec / len(ngram_hashes) occurs is only for OOV words, where no full-word trained vector (from vectors_vocab) exists. Thus it's only the average of the ngrams. OTOH, the adjust_vectors() codepath is specifically calculating the effective vectors only for in-vocabulary words, which are by definition composed from the trained full-word vector plus all the ngram-vectors. That is, one more vector is part of the average for in-vocabulary words.

Of the many oddnesses of the gensim FT implementation, this behavior (as noted by @vackosar in comment #2537 (comment)) is consistent with the FB reference implementation, so there's no bug, and no reason for the associated code-changing PR.

mpenkov · 2020-06-27T14:04:34Z

@gojomo So if there is no problem, we can close #2551?

gojomo · 2020-06-27T17:02:41Z

@gojomo So if there is no problem, we can close #2551?

Yes, no bug & thus no fix needed! (And my comment on the calculation there also points out the change doesn't match the number of summands.)

vackosar closed this as completed Jun 27, 2019

piskvorky reopened this Jun 28, 2019

vackosar added a commit to vackosar/gensim that referenced this issue Jul 4, 2019

revert to original ngram normalization in FastTextKeyedVectors#adjust…

4063741

…_vectors piskvorky#2537

vackosar mentioned this issue Jul 4, 2019

revert to original ngram normalization in FastTextKeyedVectors#adjust… #2551

Closed

mpenkov self-assigned this Jun 10, 2020

gojomo mentioned this issue Jun 19, 2020

Warning in word2vec when batch_words > MAX_WORDS_IN_BATCH #2861

Closed

mpenkov closed this as completed Jun 28, 2020

mpenkov mentioned this issue Oct 28, 2020

Update changelog for 4.0.0 release #2981

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FastText inconsistent normalization in adjust_vectors method vs word_vec #2537

FastText inconsistent normalization in adjust_vectors method vs word_vec #2537

vackosar commented Jun 25, 2019

vackosar commented Jun 27, 2019

piskvorky commented Jun 27, 2019 •

edited

Loading

mpenkov commented Jun 28, 2019

piskvorky commented Jun 28, 2019

vackosar commented Jul 4, 2019 •

edited

Loading

vackosar commented Jul 4, 2019

vackosar commented Jul 4, 2019

gojomo commented Jun 19, 2020 •

edited

Loading

mpenkov commented Jun 27, 2020 •

edited

Loading

gojomo commented Jun 27, 2020 •

edited

Loading

FastText inconsistent normalization in adjust_vectors method vs word_vec #2537

FastText inconsistent normalization in adjust_vectors method vs word_vec #2537

Comments

vackosar commented Jun 25, 2019

Problem description

Steps/code/corpus to reproduce

Versions

vackosar commented Jun 27, 2019

piskvorky commented Jun 27, 2019 • edited Loading

mpenkov commented Jun 28, 2019

piskvorky commented Jun 28, 2019

vackosar commented Jul 4, 2019 • edited Loading

vackosar commented Jul 4, 2019

vackosar commented Jul 4, 2019

gojomo commented Jun 19, 2020 • edited Loading

mpenkov commented Jun 27, 2020 • edited Loading

gojomo commented Jun 27, 2020 • edited Loading

piskvorky commented Jun 27, 2019 •

edited

Loading

vackosar commented Jul 4, 2019 •

edited

Loading

gojomo commented Jun 19, 2020 •

edited

Loading

mpenkov commented Jun 27, 2020 •

edited

Loading

gojomo commented Jun 27, 2020 •

edited

Loading