-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix WordEmbeddingsKeyedVectors.most_similar #2356
Conversation
cade7bd
to
4463d31
Compare
Thanks a lot @Witiko ! Does this fix the reported issue fully, or are there still lingering/unresolved issues? |
@piskvorky This PR fixes the new issue reported today in #2105 with no apparent backwards-compatibility problems. This PR does not fix the originally reported issue in #2105, because the poster never got back to us and we were unable to reproduce their issue. Nevertheless, that issue is long past its expiry date, since the code that it refers to is no longer even part of Gensim. Therefore, this PR should close #2105. There is something that should be discussed before a merge. Before to this PR, >>> from gensim.models.word2vec import Word2Vec
>>> from gensim.test.utils import common_texts
>>>
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> model.wv.most_similar(positive=['computer'], topn=2)
[('response', 0.38100379705429077), ('minors', 0.3752439618110657)]
>>> model.wv.most_similar(positive=['computer'], topn=0)
array([-0.1180886 , 0.32174808, -0.02938104, -0.21145007, 0.37524396,
-0.23777878, 0.99999994, -0.01436211, 0.36708638, -0.09770551,
0.05963777, 0.3810038 ], dtype=float32) This PR enforces the following documented behavior: >>> model.wv.most_similar(positive=['computer'], topn=0)
[] However, the unit tests reveal that the undocumented behavior is expected:
This PR removes the unit test. A more conservative approach would be to leave |
4463d31
to
f4a2d00
Compare
That's weird. Why would anybody call The The backward compatibility issue I was talking about is here: |
We are calling The “backward compatibility issue” is related to 3f04940. Before Gensim 3.7, |
According to the unit tests, no other code relies on the undocumented behavior. There are no comments in the code. |
5275ccb
to
916013b
Compare
To solve the dilemma, I added back the support for the undocumented |
916013b
to
06781e2
Compare
I am still confused. First I saw that the line numbers are different: it is 1401, not as patch declares 1370. Then I saw another difference: what is supposed to be a patch, there is already! And is still making an error:
|
@tvrbanec In your case, applying 01c5336 should be enough: diff --git a/gensim/models/keyedvectors.py b/gensim/models/keyedvectors.py
index ff79212c..ad9c90c7 100644
--- a/gensim/models/keyedvectors.py
+++ b/gensim/models/keyedvectors.py
@@ -511,6 +511,9 @@ class WordEmbeddingsKeyedVectors(BaseKeyedVectors):
Sequence of (word, similarity).
"""
+ if topn < 1:
+ return []
+
if positive is None:
positive = []
if negative is None:
@@ -550,8 +553,6 @@ class WordEmbeddingsKeyedVectors(BaseKeyedVectors):
limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
dists = dot(limited, mean)
- if not topn:
- return dists
best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)
# ignore (don't return) words from the input
result = [(self.index2word[sim], float(dists[sim])) for sim in best if sim not in all_words] Line numbers may differ, since you are using the release version of Gensim, whereas the patches are against the development branch of Gensim. Any luck? |
OK. I tried. First and second patch wasn't big deal. But third, which is probably the key one can't be done. There is no part that should be changed: I have two options now:
|
The However, just applying the single patch above (manually adding three lines and deleting two lines) should do the trick. No need to apply the entire pull request. 🙂 |
After applying patch above.
|
I cannot reproduce your issue. Here is what I did:
Does the above procedure produce a working setup for you? |
Hm.
|
Never mind. I'll go back to 3.6. Thanks for time and energy. I just don't have it more. I hope new release will come soon. |
If you are installing to the same python environment as before, you may need to use |
This PR should be good to merge. |
Thanks for the fast fix @Witiko |
It seems to me that everything is fine in 3.7.1. |
I am glad to hear that. Then #2105 should be closed. |
This PR fixes two issues discovered in #2105:
gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar
method returns an unexpected result (a numeric array rather than an empty list) whentopn=0
.gensim.similarities.SoftCosineSimilarity
needs fixing.