Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix WordEmbeddingsKeyedVectors.most_similar #2356

Merged
merged 3 commits into from
Jan 27, 2019

Conversation

Witiko
Copy link
Contributor

@Witiko Witiko commented Jan 25, 2019

This PR fixes two issues discovered in #2105:

  • The gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar method returns an unexpected result (a numeric array rather than an empty list) when topn=0.
  • The example code for gensim.similarities.SoftCosineSimilarity needs fixing.

@Witiko Witiko changed the title Fix gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar Fix WordEmbeddingsKeyedVectors.most_similar Jan 25, 2019
@Witiko Witiko force-pushed the fix-most_similar branch 2 times, most recently from cade7bd to 4463d31 Compare January 25, 2019 18:34
@piskvorky
Copy link
Owner

piskvorky commented Jan 25, 2019

Thanks a lot @Witiko ! Does this fix the reported issue fully, or are there still lingering/unresolved issues?
I remember reading something about a problem with backward compatibility.

@Witiko Witiko closed this Jan 25, 2019
@Witiko
Copy link
Contributor Author

Witiko commented Jan 25, 2019

@piskvorky This PR fixes the new issue reported today in #2105 with no apparent backwards-compatibility problems. This PR does not fix the originally reported issue in #2105, because the poster never got back to us and we were unable to reproduce their issue. Nevertheless, that issue is long past its expiry date, since the code that it refers to is no longer even part of Gensim. Therefore, this PR should close #2105.

There is something that should be discussed before a merge. Before to this PR, WordEmbeddingsKeyedVectors.most_similar would return an array of similarities for all terms in a vocabulary when topn=0. This behavior was undocumented:

>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.test.utils import common_texts
>>> 
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> model.wv.most_similar(positive=['computer'], topn=2)
[('response', 0.38100379705429077), ('minors', 0.3752439618110657)]
>>> model.wv.most_similar(positive=['computer'], topn=0)
array([-0.1180886 ,  0.32174808, -0.02938104, -0.21145007,  0.37524396,
       -0.23777878,  0.99999994, -0.01436211,  0.36708638, -0.09770551,
        0.05963777,  0.3810038 ], dtype=float32)

This PR enforces the following documented behavior:

>>> model.wv.most_similar(positive=['computer'], topn=0)
[]

However, the unit tests reveal that the undocumented behavior is expected:

================================== FAILURES ===================================
______________ TestEuclideanKeyedVectors.test_most_similar_topn _______________
self = <gensim.test.test_keyedvectors.TestEuclideanKeyedVectors testMethod=test_most_similar_topn>
    def test_most_similar_topn(self):
        """Test most_similar returns correct results when `topn` is specified."""
        self.assertEqual(len(self.vectors.most_similar('war', topn=5)), 5)
        self.assertEqual(len(self.vectors.most_similar('war', topn=10)), 10)
    
        predicted = self.vectors.most_similar('war', topn=None)
>       self.assertEqual(len(predicted), len(self.vectors.vocab))
E       AssertionError: 0 != 2747
predicted  = []
self       = <gensim.test.test_keyedvectors.TestEuclideanKeyedVectors testMethod=test_most_similar_topn>
gensim\test\test_keyedvectors.py:107: AssertionError
============================== warnings summary ===============================
== 1 failed, 948 passed, 94 skipped, 74 warnings, 3 rerun in 1070.41 seconds ==
ERROR: InvocationError for command 'C:\\projects\\gensim-431bq\\.tox\\py27-win\\Scripts\\pytest.EXE gensim/test' (exited with code 1)
py27-win finish: runtests after 1085.62 seconds
py27-win start: run-test-post 
py27-win finish: run-test-post after 0.00 seconds
___________________________________ summary ___________________________________
ERROR:   py27-win: commands failed
Command exited with code 1

This PR removes the unit test. A more conservative approach would be to leave WordEmbeddingsKeyedVectors.most_similar alone, document the behavior, and patch the caller to special-case topn=0. Someone with more insight into the code base should probably decide this.

@Witiko Witiko reopened this Jan 25, 2019
@piskvorky
Copy link
Owner

piskvorky commented Jan 25, 2019

That's weird. Why would anybody call most_similar with topn=0? How can that happen? Error in user code?

The topn=None case makes some sense (user is asking for all results; 0 probably falls under None, my guess would be the check is only if topn: instead of if topn is None:). I don't know if it's worth keeping around, why this special case was added. Are there any comments?

The backward compatibility issue I was talking about is here:
https://groups.google.com/forum/#!topic/gensim/Z4nHzwCuUl0

@Witiko
Copy link
Contributor Author

Witiko commented Jan 25, 2019

We are calling most_similar with topn=0 when constructing a similarity matrix column that already contains the maximum number of nonzero elements. This can happen when symmetric=True.

The “backward compatibility issue” is related to 3f04940. Before Gensim 3.7, most_similar would also be called with topn=0, but the caller code was more defensive and would return an empty slice regardless of what most_similar would return. In Gensim 3.7, the code is less defensive and expects most_similar to return an empty list when topn=0 as documented. To summarize, this PR fixes a new issue that appeared in Gensim 3.7. Backwards compatibility is retained.

@Witiko
Copy link
Contributor Author

Witiko commented Jan 25, 2019

According to the unit tests, no other code relies on the undocumented behavior. There are no comments in the code.

@Witiko
Copy link
Contributor Author

Witiko commented Jan 25, 2019

To solve the dilemma, I added back the support for the undocumented topn=None behavior in 06781e2. It is still undocumented, but at least we are not breaking anyone's code unnecessarily.

@tvrbanec
Copy link

tvrbanec commented Jan 25, 2019

I am still confused.
I wanted to apply fix in my /usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py

First I saw that the line numbers are different: it is 1401, not as patch declares 1370.

Then I saw another difference: what is supposed to be a patch, there is already! And is still making an error:

    similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)  # construct similarity matrix
  File "/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.py", line 234, in __init__
    for term, similarity in index.most_similar(t1, num_rows)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1401, in most_similar
    for t2, similarity in most_similar:
TypeError: 'numpy.float32' object is not iterable

@Witiko
Copy link
Contributor Author

Witiko commented Jan 25, 2019

@tvrbanec In your case, applying 01c5336 should be enough:

diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index ff79212c..ad9c90c7 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -511,6 +511,9 @@ class WordEmbeddingsKeyedVectors(BaseKeyedVectors):
             Sequence of (word, similarity).
 
         """
+        if topn < 1:
+            return []
+
         if positive is None:
             positive = []
         if negative is None:
@@ -550,8 +553,6 @@ class WordEmbeddingsKeyedVectors(BaseKeyedVectors):
 
         limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
         dists = dot(limited, mean)
-        if not topn:
-            return dists
         best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)
         # ignore (don't return) words from the input
         result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]

Line numbers may differ, since you are using the release version of Gensim, whereas the patches are against the development branch of Gensim. Any luck?

@tvrbanec
Copy link

tvrbanec commented Jan 25, 2019

OK. I tried. First and second patch wasn't big deal. But third, which is probably the key one can't be done. There is no part that should be changed:
if topn < 1:
and there are no surroundings of code that needs to be insert.

I have two options now:

  1. To give up until the official patch version (3.7.1?) or the new version (3.8) is released. I hope you will not wait long?
  2. Switch to the developer version. This is not possible because I test on three completely different machines, on one I have only VI editor. Too much effort and time.

@Witiko
Copy link
Contributor Author

Witiko commented Jan 25, 2019

There is no part that should be changed:

if topn < 1:

and there are no surroundings of code that needs to be insert.

The if topn < 1: part is added by the first patch (01c5336), are you sure you applied it correctly?

However, just applying the single patch above (manually adding three lines and deleting two lines) should do the trick. No need to apply the entire pull request. 🙂

@tvrbanec
Copy link

tvrbanec commented Jan 25, 2019

After applying patch above.

    similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)  # construct similarity matrix
  File "/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.py", line 234, in __init__
    for term, similarity in index.most_similar(t1, num_rows)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1403, in most_similar
    most_similar = self.keyedvectors.most_similar(positive=[t1], topn=topn, **self.kwargs)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 550, in most_similar
    raise ValueError("cannot compute similarity with no input")
ValueError: cannot compute similarity with no input

@Witiko
Copy link
Contributor Author

Witiko commented Jan 25, 2019

I cannot reproduce your issue. Here is what I did:

  1. I installed Gensim 3.7.0 using Python 2.7:
    $ git clone https://github.com/RaRe-Technologies/gensim.git
    $ cd gensim
    $ git checkout 3.7.0
    $ pip install .
    
  2. I applied the patch:
    $ git apply
    diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
    index ff79212c..ad9c90c7 100644
    --- a/gensim/models/keyedvectors.py
    +++ b/gensim/models/keyedvectors.py
    @@ -511,6 +511,9 @@ class WordEmbeddingsKeyedVectors(BaseKeyedVectors):
                 Sequence of (word, similarity).
     
             """
    +        if topn < 1:
    +            return []
    +
             if positive is None:
                 positive = []
             if negative is None:
    @@ -550,8 +553,6 @@ class WordEmbeddingsKeyedVectors(BaseKeyedVectors):
     
             limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
             dists = dot(limited, mean)
    -        if not topn:
    -            return dists
             best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)
             # ignore (don't return) words from the input
             result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words]
  3. Then, I ran your code:
    >>> from gensim.corpora import Dictionary
    >>> from gensim.models.keyedvectors import WordEmbeddingSimilarityIndex
    >>> from gensim.models.word2vec import Word2Vec
    >>> from gensim.similarities import SparseTermSimilarityMatrix
    >>> 
    >>> corpus = [['how', 'exile', 'selfie', ...]]  # from <https://pastebin.com/at7idpCV>
    >>> model = Word2Vec(corpus, size=20, min_count=1)
    >>> termsim_index = WordEmbeddingSimilarityIndex(model.wv)
    >>> dictionary = Dictionary(corpus)
    >>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
    >>> similarity_matrix
    <gensim.similarities.termsim.SparseTermSimilarityMatrix object at 0x7eff8d0d4e10>

Does the above procedure produce a working setup for you?

@tvrbanec
Copy link

Hm.

    from gensim import corpora
  File "/usr/local/lib/python2.7/dist-packages/gensim/__init__.py", line 5, in <module>
    from gensim import parsing, corpora, matutils, interfaces, models, similarities, summarization, utils  # noqa:F401
  File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/__init__.py", line 6, in <module>
    from .indexedcorpus import IndexedCorpus  # noqa:F401 must appear before the other classes
  File "/usr/local/lib/python2.7/dist-packages/gensim/corpora/indexedcorpus.py", line 15, in <module>
    from gensim import interfaces, utils
  File "/usr/local/lib/python2.7/dist-packages/gensim/interfaces.py", line 21, in <module>
    from gensim import utils, matutils
  File "/usr/local/lib/python2.7/dist-packages/gensim/matutils.py", line 1082, in <module>
    from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
  File "__init__.pxd", line 918, in init gensim._matutils
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

@tvrbanec
Copy link

Never mind. I'll go back to 3.6. Thanks for time and energy. I just don't have it more. I hope new release will come soon.

@Witiko
Copy link
Contributor Author

Witiko commented Jan 25, 2019

If you are installing to the same python environment as before, you may need to use pip install --force-reinstall --ignore-installed . to force the reinstall. I am sorry about the broken feature. Hopefully, you will regain your appetite for tinkering with Gensim in a couple of days. 🙂

@Witiko
Copy link
Contributor Author

Witiko commented Jan 27, 2019

This PR should be good to merge.

@menshikh-iv menshikh-iv merged commit 0bc2ca6 into piskvorky:develop Jan 27, 2019
@menshikh-iv
Copy link
Contributor

Thanks for the fast fix @Witiko

@tvrbanec
Copy link

tvrbanec commented Feb 2, 2019

It seems to me that everything is fine in 3.7.1.

@Witiko
Copy link
Contributor Author

Witiko commented Feb 2, 2019

I am glad to hear that. Then #2105 should be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants