Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FastText inconsistent normalization in adjust_vectors method vs word_vec #2537

Closed
vackosar opened this issue Jun 25, 2019 · 10 comments
Closed
Assignees

Comments

@vackosar
Copy link

Problem description

There is inconsistency in normalization in FastTextKeyedVectors#adjust_vectors vs WordEmbeddingsKeyedVectors#word_vec

Steps/code/corpus to reproduce

adjust_vectors:

            word_vec /= len(ngram_hashes) + 1

word_vec method:

            return word_vec / len(ngram_hashes)

Versions

Linux-4.18.0-24-generic-x86_64-with-Ubuntu-18.04-bionic
Python 3.7.1 (default, Oct 22 2018, 11:21:55) 
[GCC 8.2.0]
NumPy 1.16.3
SciPy 1.2.1
gensim 3.7.3
FAST_VERSION 1
@vackosar
Copy link
Author

The adjustment_vectors method adds subword representations to the words themselves, where +1 seems to represent original learned vector of the whole word.

@piskvorky
Copy link
Owner

piskvorky commented Jun 27, 2019

@vackosar if that's the case, can you add a code comment to the relevant places? It might trip up others too, in the future, so let's be explicit. CC @mpenkov can you confirm?

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 28, 2019

I agree that the two implementations appear to be inconsistent. I think this is a bug.

This is what I rewrote when I worked on fastText a few months ago:

b452a5b59 (Michael Penkov       2019-01-12 00:07:40 +0900 2289)             word_vec = np.copy(self.vectors_vocab[v.index])
411f54663 (Michael Penkov       2019-01-24 13:51:40 +0900 2290)             ngram_hashes = ft_ngram_hashes(w, self.min_n, self.max_n, self.bucket, self.compatible_hash)
411f54663 (Michael Penkov       2019-01-24 13:51:40 +0900 2291)             for nh in ngram_hashes:
b1850d948 (Michael Penkov       2019-03-07 15:44:31 +0900 2292)                 word_vec += self.vectors_ngrams[nh]
411f54663 (Michael Penkov       2019-01-24 13:51:40 +0900 2293)             word_vec /= len(ngram_hashes) + 1
b452a5b59 (Michael Penkov       2019-01-12 00:07:40 +0900 2294)             self.vectors[v.index] = word_vec

This is the old version of the FastText code:

9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2001)             ngrams_found = 0
916e423a1 (Shiva Manne          2018-02-01 21:50:20 +0530 2002)             for ngram in ngrams:
9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2003)                 ngram_hash = _ft_hash(ngram) % self.bucket
9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2004)                 if ngram_hash in self.hash2index:
9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2005)                     word_vec += ngram_weights[self.hash2index[ngram_hash]]
9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2006)                     ngrams_found += 1
916e423a1 (Shiva Manne          2018-02-01 21:50:20 +0530 2007)             if word_vec.any():
9021ea8b3 (Johannes Baiter      2018-03-01 09:17:43 +0100 2008)                 return word_vec / max(1, ngrams_found)
916e423a1 (Shiva Manne          2018-02-01 21:50:20 +0530 2009)             else:  # No ngrams of the word are present in self.ngrams
916e423a1 (Shiva Manne          2018-02-01 21:50:20 +0530 2010)                 raise KeyError('all ngrams for word %s absent from model' % word)

I think the change to make here is:

  • Before: word_vec /= len(ngram_hashes) + 1
  • After: word_vec /= max(len(ngram_hashes), 1)

@piskvorky piskvorky reopened this Jun 28, 2019
@piskvorky
Copy link
Owner

@vackosar can you open a PR?

@vackosar
Copy link
Author

vackosar commented Jul 4, 2019

@piskvorky @mpenkov I will submit the PR but are you sure about this?
In the original implementation the word itself seems to be appended to its n-grams and then everytime when there is a normalization they just divide by ngram.size() which would mean there is hidden "+1" there too. Unless u actually also have the word part of its own ngrams in that case you would be adding another "+1" here.

@vackosar
Copy link
Author

vackosar commented Jul 4, 2019

By the way, I posted couple of FastText norm visualizations here, if you are interested.

@vackosar
Copy link
Author

vackosar commented Jul 4, 2019

Here is the PR #2551

@mpenkov mpenkov self-assigned this Jun 10, 2020
@gojomo
Copy link
Collaborator

gojomo commented Jun 19, 2020

There's no inconsistency. In the word_vec() method, the codepath where a simple return word_vec / len(ngram_hashes) occurs is only for OOV words, where no full-word trained vector (from vectors_vocab) exists. Thus it's only the average of the ngrams. OTOH, the adjust_vectors() codepath is specifically calculating the effective vectors only for in-vocabulary words, which are by definition composed from the trained full-word vector plus all the ngram-vectors. That is, one more vector is part of the average for in-vocabulary words.

Of the many oddnesses of the gensim FT implementation, this behavior (as noted by @vackosar in comment #2537 (comment)) is consistent with the FB reference implementation, so there's no bug, and no reason for the associated code-changing PR.

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 27, 2020

@gojomo So if there is no problem, we can close #2551?

@gojomo
Copy link
Collaborator

gojomo commented Jun 27, 2020

@gojomo So if there is no problem, we can close #2551?

Yes, no bug & thus no fix needed! (And my comment on the calculation there also points out the change doesn't match the number of summands.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants