Question to MultipleNegativesRankingLoss and cosine similarity #3179

lmj0415 · 2025-01-19T13:35:18Z

I wondered how reasonable it is to use the MultipleNegativesRankingLoss with cosine similarity and cross-entropy loss.

As I understand it, even in a perfect scenario where our model embedded the negative examples opposite to the positive one, there is still a considerable loss as cosine similarity is bound by [-1, 1].

Here is an example:

sim = torch.nn.CosineSimilarity(dim=1)
ce_loss = torch.nn.CrossEntropyLoss()

a = torch.randn(1, 768, requires_grad=True)

c = torch.cat([a.detach(), a.detach() * -1, a.detach() * -1])
label = torch.tensor(0)

scores = sim(a, c)
loss = ce_loss(scores, label)
print(loss)

I wondered whether this hurts the model's convergence and whether using a different loss function would take the upper and lower limits of the cosine similarity better into account.

I am still new to this, so please correct me if I am wrong with my example.

Thanks.

EDIT: Is that why you use the scale argument?

The text was updated successfully, but these errors were encountered:

tomaarsen · 2025-01-21T15:58:40Z

Hello!

The expectation of the loss is not that the model is good at a loss of 0 and bad at higher losses. We just need to ensure that lowering the loss corresponds well with getting a better that works better for our use cases, which is the case here.
For example, you could add +100 to the loss, and the model would train identically as without it.

In short, I don't believe this hurts the model's convergence, but I will admit that I'm not an expert.

As for the scale - this parameter is the inverse of the temperature that you might see elsewhere (i.e. scale = 1/temperature), and it roughly corresponds to:

A higher scale in MNRL/InfoNCE/in-batch negatives loss should result in higher focus on the positive example.
A lower scale in MNRL/InfoNCE/in-batch negatives loss should result in a more general distribution over the positive and negative examples.

At least, that is my understanding. It is relatively common to pick a scale between 20 and 50, I believe. See also #3054.

I hope this helps a little bit.

Tom Aarsen

lmj0415 · 2025-01-22T06:09:08Z

Thanks for the Answer Tom,

here is a small addition to my previous code.

sim = torch.nn.CosineSimilarity(dim=1)
ce_loss = torch.nn.CrossEntropyLoss()
lr = 1e+10

# anchor
a = torch.randn(1, 768, requires_grad=True)
_a = a.detach()

# candidates
c = torch.cat([_a, _a * -1, _a * -1])
label = torch.tensor(0)

# MultipleNegativesRankingLoss
scores = sim(a, c)
print(scores)
loss = ce_loss(scores, label)
print(loss)

# backward
loss.backward()
a = a - lr * a.grad

# similarity to candidates after learning step
print(sim(a, c))

# OUTPUT
'''
tensor([ 1.0000, -1.0000, -1.0000], grad_fn=<SumBackward1>)
tensor(0.2395, grad_fn=<NllLossBackward0>)
tensor([ 0.9258, -0.9258, -0.9258], grad_fn=<SumBackward1>))
'''

You can see that following the gradient will lead you to a worsening of the result. I highly expect that this is due to rounding errors and an obnoxious lr, but still, I think there is a point here. My intuition is that treating cosine similarity like log probability is not appropriate. Cosine Similarity is a measurement of angle and is thus bound by [-1,1]. In a setting where we observe a -1 similarity the probability assigned to that should be 0. Using the softmax function, however, this is not the case.

# score = cos sim ([ 1.0000, -1.0000, -1.0000])
# experiment of softmax
print(torch.softmax(scores, dim=0))

# OUTPUT
'''
tensor([0.7870, 0.1065, 0.1065], grad_fn=<SoftmaxBackward0>)
'''

The score parameter (next to the properties explained in #3054) helps to get the output of the softmax function closer to what it should be.

print(torch.softmax(scores * 10, dim=0))
print(torch.softmax(scores * 20, dim=0))
print(torch.softmax(scores * 30, dim=0))

# OUTPUT
'''
tensor([1.0000e+00, 2.0611e-09, 2.0611e-09], grad_fn=<SoftmaxBackward0>)
tensor([1.0000e+00, 4.2483e-18, 4.2483e-18], grad_fn=<SoftmaxBackward0>)
tensor([1.0000e+00, 8.7564e-27, 8.7564e-27], grad_fn=<SoftmaxBackward0>)
'''

NOTE: The misalignment between softmax and cos sim is btw. getting worse with the number of classes. In the following example, I printed the probability of the perfect positive sample (cosim = 1) while adding range(0,100,10) perfect negative examples (cosim = -1).

for no_cls in range(0, 100, 10):
    a = torch.tensor([1] + [-1]* no_cls, dtype=float)
    print(torch.softmax(a, dim=0)[0])

# OUTPUT
'''
tensor(1., dtype=torch.float64)
tensor(0.4249, dtype=torch.float64)
tensor(0.2698, dtype=torch.float64)
tensor(0.1976, dtype=torch.float64)
tensor(0.1559, dtype=torch.float64)
tensor(0.1288, dtype=torch.float64)
tensor(0.1096, dtype=torch.float64)
tensor(0.0955, dtype=torch.float64)
tensor(0.0846, dtype=torch.float64)
tensor(0.0759, dtype=torch.float64)
'''

As I mentioned earlier I am creating edge case examples here and I am not sure how much this misalignment affects real-world training. Nonetheless, it is a fact that cosine similarities are not logits and probably shouldn't be treated as such. Looking forward to your opinion on this.

lmj0415 · 2025-01-22T07:21:20Z

Just had a look at the math again. Not my strong suit, but this should work to mitigate the problem:

for no_cls in range(0, 100, 10):
    a = torch.tensor([1] + [-1]* no_cls, dtype=float)
    norm_a = torch.log(a + 1)
    print(torch.softmax(norm_a, dim=0)[0])

# OUTPUT
'''
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
tensor(1., dtype=torch.float64)
'''

Softmax expects logits, so we can take our cosim output and generate logits from it. Prbl there is a better way though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question to MultipleNegativesRankingLoss and cosine similarity #3179

Question to MultipleNegativesRankingLoss and cosine similarity #3179

lmj0415 commented Jan 19, 2025 •

edited

Loading

tomaarsen commented Jan 21, 2025

lmj0415 commented Jan 22, 2025 •

edited

Loading

lmj0415 commented Jan 22, 2025

Question to MultipleNegativesRankingLoss and cosine similarity #3179

Question to MultipleNegativesRankingLoss and cosine similarity #3179

Comments

lmj0415 commented Jan 19, 2025 • edited Loading

tomaarsen commented Jan 21, 2025

lmj0415 commented Jan 22, 2025 • edited Loading

lmj0415 commented Jan 22, 2025

lmj0415 commented Jan 19, 2025 •

edited

Loading

lmj0415 commented Jan 22, 2025 •

edited

Loading