Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Format for MNRL Loss with Multiple Negatives #3153

Open
yildize opened this issue Jan 4, 2025 · 3 comments
Open

Dataset Format for MNRL Loss with Multiple Negatives #3153

yildize opened this issue Jan 4, 2025 · 3 comments

Comments

@yildize
Copy link

yildize commented Jan 4, 2025

Hello there,

I have read the v3 training overview, but something is unclear to me. I would be glad if you could explain what is that I am missing.

I understand this new version accepts Dataset objects as the training dataset.

If we have a single negative for each example (e.g. each example is in the form of (anchor, pos, neg), we can easily convert it to a dataset object with three columns right?

But what if we have multiple negatives? What should be the proper input format in that case? Especially when there are varying number of negatives for each examples, like one example have two negatives: (query, pos, neg1, neg2) and other has three: (query, pos, neg1, neg2, neg3).

Thanks in advance :)

@tomaarsen
Copy link
Collaborator

Hello!

Good question! To use multiple negatives, you have to add more columns to the dataset. In Sentence Transformers, the order of the columns is most important. So, to use a loss that supports multiple negatives, e.g. as shown in the Loss Overview:
image

Then the first column will be the anchor, the second column a positive, and all subsequent columns are negatives. This is an example of a dataset with 20 negatives that works out of the box: https://huggingface.co/datasets/sentence-transformers/hotpotqa/viewer/triplet-20. This one in particular was used to train https://huggingface.co/BAAI/bge-m3.

  • Tom Aarsen

@yildize
Copy link
Author

yildize commented Jan 7, 2025

Thanks Tom :)

In the provided example we have 1 anchor, 1 positive and 20 negatives for each row right? Another question I was trying to ask was what if we have different number of negatives for each row.

Like one row with: 1 anchor, 1 positive, 10 negatives. Another row with 1 anchor, 1 positive, 5 negatives. What would be your suggested approach to such dataset?

@tomaarsen
Copy link
Collaborator

In the provided example we have 1 anchor, 1 positive and 20 negatives for each row right?

Indeed, exactly 20 for each row.

I'm afraid you have to use a fixed amount of negatives for each row. The reasoning is that we like to fit all embeddings into large tensors behind the scenes, which is not possible if some rows have a different amount of values.

  • Tom Aarsen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants