Dataset Format for MNRL Loss with Multiple Negatives #3153

yildize · 2025-01-04T13:50:18Z

Hello there,

I have read the v3 training overview, but something is unclear to me. I would be glad if you could explain what is that I am missing.

I understand this new version accepts Dataset objects as the training dataset.

If we have a single negative for each example (e.g. each example is in the form of (anchor, pos, neg), we can easily convert it to a dataset object with three columns right?

But what if we have multiple negatives? What should be the proper input format in that case? Especially when there are varying number of negatives for each examples, like one example have two negatives: (query, pos, neg1, neg2) and other has three: (query, pos, neg1, neg2, neg3).

Thanks in advance :)

tomaarsen · 2025-01-06T18:09:52Z

Hello!

Good question! To use multiple negatives, you have to add more columns to the dataset. In Sentence Transformers, the order of the columns is most important. So, to use a loss that supports multiple negatives, e.g. as shown in the Loss Overview:

Then the first column will be the anchor, the second column a positive, and all subsequent columns are negatives. This is an example of a dataset with 20 negatives that works out of the box: https://huggingface.co/datasets/sentence-transformers/hotpotqa/viewer/triplet-20. This one in particular was used to train https://huggingface.co/BAAI/bge-m3.

Tom Aarsen

yildize · 2025-01-07T05:29:50Z

Thanks Tom :)

In the provided example we have 1 anchor, 1 positive and 20 negatives for each row right? Another question I was trying to ask was what if we have different number of negatives for each row.

Like one row with: 1 anchor, 1 positive, 10 negatives. Another row with 1 anchor, 1 positive, 5 negatives. What would be your suggested approach to such dataset?

tomaarsen · 2025-01-07T08:02:03Z

In the provided example we have 1 anchor, 1 positive and 20 negatives for each row right?

Indeed, exactly 20 for each row.

I'm afraid you have to use a fixed amount of negatives for each row. The reasoning is that we like to fit all embeddings into large tensors behind the scenes, which is not possible if some rows have a different amount of values.

Tom Aarsen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset Format for MNRL Loss with Multiple Negatives #3153

Dataset Format for MNRL Loss with Multiple Negatives #3153

yildize commented Jan 4, 2025

tomaarsen commented Jan 6, 2025

yildize commented Jan 7, 2025

tomaarsen commented Jan 7, 2025

Dataset Format for MNRL Loss with Multiple Negatives #3153

Dataset Format for MNRL Loss with Multiple Negatives #3153

Comments

yildize commented Jan 4, 2025

tomaarsen commented Jan 6, 2025

yildize commented Jan 7, 2025

tomaarsen commented Jan 7, 2025