-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question to MultipleNegativesRankingLoss and cosine similarity #3179
Comments
Hello! The expectation of the loss is not that the model is good at a loss of 0 and bad at higher losses. We just need to ensure that lowering the loss corresponds well with getting a better that works better for our use cases, which is the case here. In short, I don't believe this hurts the model's convergence, but I will admit that I'm not an expert. As for the
At least, that is my understanding. It is relatively common to pick a scale between 20 and 50, I believe. See also #3054. I hope this helps a little bit.
|
Thanks for the Answer Tom, here is a small addition to my previous code.
You can see that following the gradient will lead you to a worsening of the result. I highly expect that this is due to rounding errors and an obnoxious lr, but still, I think there is a point here. My intuition is that treating cosine similarity like log probability is not appropriate. Cosine Similarity is a measurement of angle and is thus bound by [-1,1]. In a setting where we observe a -1 similarity the probability assigned to that should be 0. Using the softmax function, however, this is not the case.
The
NOTE: The misalignment between softmax and cos sim is btw. getting worse with the number of classes. In the following example, I printed the probability of the perfect positive sample (cosim = 1) while adding
As I mentioned earlier I am creating edge case examples here and I am not sure how much this misalignment affects real-world training. Nonetheless, it is a fact that cosine similarities are not logits and probably shouldn't be treated as such. Looking forward to your opinion on this. |
Just had a look at the math again. Not my strong suit, but this should work to mitigate the problem:
Softmax expects logits, so we can take our cosim output and generate logits from it. Prbl there is a better way though. |
I wondered how reasonable it is to use the MultipleNegativesRankingLoss with cosine similarity and cross-entropy loss.
As I understand it, even in a perfect scenario where our model embedded the negative examples opposite to the positive one, there is still a considerable loss as cosine similarity is bound by [-1, 1].
Here is an example:
I wondered whether this hurts the model's convergence and whether using a different loss function would take the upper and lower limits of the cosine similarity better into account.
I am still new to this, so please correct me if I am wrong with my example.
Thanks.
EDIT: Is that why you use the scale argument?
The text was updated successfully, but these errors were encountered: