Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v2] fix contriever (add similarity_fn_name to ST wrapper) #1749

Merged
merged 7 commits into from
Jan 11, 2025

Conversation

Samoed
Copy link
Collaborator

@Samoed Samoed commented Jan 10, 2025

Closes #1731

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

@Samoed Samoed changed the title add dotwrapper fix contriever (add dotwrapper) Jan 10, 2025
@Samoed
Copy link
Collaborator Author

Samoed commented Jan 10, 2025

@isaac-chung Model loading gives

System.IO.IOException: No space left on device : '/home/runner/runners/2.321.0/_diag/Worker_20250110-152956-utc.log' ...

@sam-hey
Copy link
Contributor

sam-hey commented Jan 10, 2025

Just as an idea: I believe the ModelMeta object already has this information similarity_fn_name="cosine". Why not add an Enum type to specify the evaluation function? This would make it much more straightforward to add new models. There's no need to implement a new wrapper class—just one Enum pointing to the fn

class EvaluationFunction(Enum):
    DOT_PRODUCT = dot_product
    MAX_SIM = max_similarity
    COSINE = cosine_similarity

@Samoed
Copy link
Collaborator Author

Samoed commented Jan 10, 2025

Yes, I wanted to integrate ModelMeta with the wrapper somehow, but for now, they’re independent. I’m not sure if it’s a good idea to pass arguments from ModelMeta to the loader. Thank you for ideas!

@sam-hey
Copy link
Contributor

sam-hey commented Jan 10, 2025

image

Please take a look at the screenshot. The MTEB ModelMeta is already available.

https://github.com/embeddings-benchmark/mteb/blob/v2.0.0/mteb/evaluation/evaluators/model_classes.py#L324:L326

Comment on lines 131 to 133
class SentenceTransformerWrapperDotSimilarity(SentenceTransformerWrapper):
def similarity(self, embedding1: np.ndarray, embedding2: np.ndarray) -> float:
return dot_distance(embedding1, embedding2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it possible to have the model self define their similarity function? Any reason for us to create a custom wrapper?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if this is a one-of I would just move it next to the special case)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model that trained with sentence transformers have ability to provide it, but this model doesn't have config for this as was discussed in #1731. I think there can be more models with different similarity functions. Maybe we can use approach from #1759

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we tried defining a Model Meta for the contriever model with the dot similarity function? I thought that was the same way we implemented Colbert with @sam-hey but with max-sim.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For colbert models created different wrapper with max-sim

Copy link
Contributor

@sam-hey sam-hey Jan 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

          > it's a bit surprising that ModelMeta.similarity_fn_name isn't being utilized.

We would love to switch to that one and would encourage a PR for this. It was only recently added to haven't been integrated throughout.

Originally posted by @KennethEnevoldsen in #1592 (comment)


I believe ModelMeta was never fully utilized. Instead of reverting to the previous version—which I agree was a better approach—we might be able to resolve these issues by properly implementing ModelMeta now.

@isaac-chung's comment :

I'd like to understand why, and perhaps convince us to move/revert back to what we had before, as to me, it worked better than needing to define separate wrappers per scoring function, which isn't scalable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the approach so that the similarity function can now be directly passed to the wrapper.

Previously, users had to specify which similarity function to use, which was a hidden feature and independent of the models.

evaluation.run(
    model,
    score_function="max_sim",
)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve updated it again, and now it uses a string similarity_fn_name like it stated in ModelMeta. I suggest leaving it as is for now and integrate ModelMeta parameters in a separate PR, as we’re currently duplicating model names and revisions for every model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks all. I felt that we could have improved docs for the previous 'hidden' feature, rather than removing it.

I agree that we can improve the ModelMeta param integration in a separate PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation required users to explicitly pass the scoring function, which could lead to issues with results that were hard to reproduce or match. This new approach reduces user input.

@Samoed Samoed changed the title fix contriever (add dotwrapper) [v2] fix contriever (add dotwrapper) Jan 11, 2025
@isaac-chung isaac-chung changed the title [v2] fix contriever (add dotwrapper) [v2] fix contriever (add similarity_fn_name to ST wrapper) Jan 11, 2025
Copy link
Collaborator

@isaac-chung isaac-chung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! Feel free to skip the model loading test while I work on a fix.

Perhaps we could repurpose #1759 a bit now for the model meta integration.

@Samoed Samoed merged commit 2b41cb4 into v2.0.0 Jan 11, 2025
10 of 11 checks passed
@Samoed Samoed deleted the fix_contriever branch January 11, 2025 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Non Cosine-Sim Similarity functions for Sentence Transformer models are broken in v2.0.0
4 participants