Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question to MultipleNegativesRankingLoss and cosine similarity #3179

Open
lmj0415 opened this issue Jan 19, 2025 · 3 comments
Open

Question to MultipleNegativesRankingLoss and cosine similarity #3179

lmj0415 opened this issue Jan 19, 2025 · 3 comments

Comments

@lmj0415
Copy link

lmj0415 commented Jan 19, 2025

I wondered how reasonable it is to use the MultipleNegativesRankingLoss with cosine similarity and cross-entropy loss.

As I understand it, even in a perfect scenario where our model embedded the negative examples opposite to the positive one, there is still a considerable loss as cosine similarity is bound by [-1, 1].

Here is an example:

sim = torch.nn.CosineSimilarity(dim=1)
ce_loss = torch.nn.CrossEntropyLoss()

a = torch.randn(1, 768, requires_grad=True)

c = torch.cat([a.detach(), a.detach() * -1, a.detach() * -1])
label = torch.tensor(0)

scores = sim(a, c)
loss = ce_loss(scores, label)
print(loss)

I wondered whether this hurts the model's convergence and whether using a different loss function would take the upper and lower limits of the cosine similarity better into account.

I am still new to this, so please correct me if I am wrong with my example.

Thanks.

EDIT: Is that why you use the scale argument?

@tomaarsen
Copy link
Collaborator

Hello!

The expectation of the loss is not that the model is good at a loss of 0 and bad at higher losses. We just need to ensure that lowering the loss corresponds well with getting a better that works better for our use cases, which is the case here.
For example, you could add +100 to the loss, and the model would train identically as without it.

In short, I don't believe this hurts the model's convergence, but I will admit that I'm not an expert.

As for the scale - this parameter is the inverse of the temperature that you might see elsewhere (i.e. scale = 1/temperature), and it roughly corresponds to:

  • A higher scale in MNRL/InfoNCE/in-batch negatives loss should result in higher focus on the positive example.
  • A lower scale in MNRL/InfoNCE/in-batch negatives loss should result in a more general distribution over the positive and negative examples.

At least, that is my understanding. It is relatively common to pick a scale between 20 and 50, I believe. See also #3054.

I hope this helps a little bit.

  • Tom Aarsen

@lmj0415
Copy link
Author

lmj0415 commented Jan 22, 2025

Thanks for the Answer Tom,

here is a small addition to my previous code.

sim = torch.nn.CosineSimilarity(dim=1)
ce_loss = torch.nn.CrossEntropyLoss()
lr = 1e+10

# anchor
a = torch.randn(1, 768, requires_grad=True)
_a = a.detach()

# candidates
c = torch.cat([_a, _a * -1, _a * -1])
label = torch.tensor(0)

# MultipleNegativesRankingLoss
scores = sim(a, c)
print(scores)
loss = ce_loss(scores, label)
print(loss)

# backward
loss.backward()
a = a - lr * a.grad

# similarity to candidates after learning step
print(sim(a, c))

# OUTPUT
'''
tensor([ 1.0000, -1.0000, -1.0000], grad_fn=<SumBackward1>)
tensor(0.2395, grad_fn=<NllLossBackward0>)
tensor([ 0.9258, -0.9258, -0.9258], grad_fn=<SumBackward1>))
'''

You can see that following the gradient will lead you to a worsening of the result. I highly expect that this is due to rounding errors and an obnoxious lr, but still, I think there is a point here. My intuition is that treating cosine similarity like log probability is not appropriate. Cosine Similarity is a measurement of angle and is thus bound by [-1,1]. In a setting where we observe a -1 similarity the probability assigned to that should be 0. Using the softmax function, however, this is not the case.

# score = cos sim ([ 1.0000, -1.0000, -1.0000])
# experiment of softmax
print(torch.softmax(scores, dim=0))

# OUTPUT
'''
tensor([0.7870, 0.1065, 0.1065], grad_fn=<SoftmaxBackward0>)
'''

The score parameter (next to the properties explained in #3054) helps to get the output of the softmax function closer to what it should be.

print(torch.softmax(scores * 10, dim=0))
print(torch.softmax(scores * 20, dim=0))
print(torch.softmax(scores * 30, dim=0))

# OUTPUT
'''
tensor([1.0000e+00, 2.0611e-09, 2.0611e-09], grad_fn=<SoftmaxBackward0>)
tensor([1.0000e+00, 4.2483e-18, 4.2483e-18], grad_fn=<SoftmaxBackward0>)
tensor([1.0000e+00, 8.7564e-27, 8.7564e-27], grad_fn=<SoftmaxBackward0>)
'''

NOTE: The misalignment between softmax and cos sim is btw. getting worse with the number of classes. In the following example, I printed the probability of the perfect positive sample (cosim = 1) while adding range(0,100,10) perfect negative examples (cosim = -1).

for no_cls in range(0, 100, 10):
    a = torch.tensor([1] + [-1]* no_cls, dtype=float)
    print(torch.softmax(a, dim=0)[0])

# OUTPUT
'''
tensor(1., dtype=torch.float64)
tensor(0.4249, dtype=torch.float64)
tensor(0.2698, dtype=torch.float64)
tensor(0.1976, dtype=torch.float64)
tensor(0.1559, dtype=torch.float64)
tensor(0.1288, dtype=torch.float64)
tensor(0.1096, dtype=torch.float64)
tensor(0.0955, dtype=torch.float64)
tensor(0.0846, dtype=torch.float64)
tensor(0.0759, dtype=torch.float64)
'''

As I mentioned earlier I am creating edge case examples here and I am not sure how much this misalignment affects real-world training. Nonetheless, it is a fact that cosine similarities are not logits and probably shouldn't be treated as such. Looking forward to your opinion on this.

@lmj0415
Copy link
Author

lmj0415 commented Jan 22, 2025

Just had a look at the math again. Not my strong suit, but this should work to mitigate the problem:

for no_cls in range(0, 100, 10):
    a = torch.tensor([1] + [-1]* no_cls, dtype=float)
    norm_a = torch.log(a + 1)
    print(torch.softmax(norm_a, dim=0)[0])

# OUTPUT
'''
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
'''

Softmax expects logits, so we can take our cosim output and generate logits from it. Prbl there is a better way though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants