feat: Evaluate missing splits #1525

isaac-chung · 2024-11-28T21:50:57Z

Based on fix: evaluate missing splits #1268 and addresses most of the comments
Once run, the returned result will contain the missing splits as well as the already run splits
Requires overwrite=True to work
Changed tests to use the same MTEB object instead of a new one

Example Usage

from mteb import MTEB, get_tasks, get_model

tasks = get_tasks(tasks=["Banking77Classification"])
model = get_model("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")

# Runs only one split
evaluation = MTEB(tasks=tasks)
results = evaluation.run(
    model,
    eval_splits=["val"],
    output_folder=str(tmp_path / "testcase2"),
    verbosity=2,
)

# Will run the missing split and return results of both splits.
results2 = evaluation.run(
    model,
    eval_splits=["val", "test"],
    output_folder=str(tmp_path / "testcase2"),
    verbosity=2,
    overwrite_results=True,
)

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

* implement partial evaluation for missing splits * lint * requested changes done from scratch * test for missing split evaluation added * uncomment test * lint * avoid circular import * use TaskResult * skip tests for now --------- Co-authored-by: Isaac Chung <[email protected]>

Samoed

Overall looks great!

mteb/evaluation/MTEB.py

Samoed · 2024-11-28T22:35:18Z

mteb/evaluation/MTEB.py

+        merged_results = TaskResult(
+            dataset_revision=existing_results.dataset_revision,
+            task_name=existing_results.task_name,
+            mteb_version=existing_results.mteb_version,


What we should do if existing result and new result have different version?

One solution is only to extend results if the versions match.

only to extend results if the versions match.

This sounds like a natural line in the sand for now, at least for key versions where the results object are drastically different (e.g. pre-1.11.0). I can open an improvement issue to handle results from different versions?

[edit]: Hmm doesn't TaskResult.from_disk handle version difference already? There are methods like _convert_from_before_v1_11_0 and checks for pre_v_12_48.

Then maybe to use verion from new_results? Because it will dump in format of running version

Sure, will do.

Separately, I don't think this currently takes the difference in dataset version into consideration. I think this is already an existing gap, where we only check whether the same model + model revision has been run, but not check for dataset version. We should probably address it here before merging. wdyt?

Most datasets downloaded by revision, so I don't think we need more checks

Ok, then this should be good to merge once the tests pass.

I think it is good to merge. It might be nice to create a "deprecation version", e.g. before 1.11.0, we can then slowly outdated old results as needed. However, this is probably something for a more general discussion, before any implementation.

KennethEnevoldsen

Looks great only a few minor things

mteb/evaluation/MTEB.py

isaac-chung · 2024-11-29T13:06:40Z

Thanks @Samoed and @KennethEnevoldsen for reviewing, and @thivyanth for the initial iteration! Merging now.

* fix: Count unique texts, data leaks in calculate metrics (#1438) * add more stat * add more stat * update statistics * fix: update task metadata to allow for null (#1448) * Update tasks table * 1.19.5 Automatically generated by python-semantic-release * Fix: Made data parsing in the leaderboard figure more robust (#1450) Bugfixes with data parsing in main figure * Fixed task loading (#1451) * Fixed task result loading from disk * Fixed task result loading from disk * fix: publish (#1452) * 1.19.6 Automatically generated by python-semantic-release * fix: Fix load external results with `None` mteb_version (#1453) * fix * lint * 1.19.7 Automatically generated by python-semantic-release * WIP: Polishing up leaderboard UI (#1461) * fix: Removed column wrapping on the table, so that it remains readable * Added disclaimer to figure * fix: Added links to task info table, switched out license with metric * fix: loading pre 1.11.0 (#1460) * small fix * fix: fix * 1.19.8 Automatically generated by python-semantic-release * fix: swap touche2020 to maintain compatibility (#1469) swap touche2020 for parity * 1.19.9 Automatically generated by python-semantic-release * docs: Add sum per language for task counts (#1468) * add sum per lang * add sort by sum option * make lint * fix: pinned datasets to <3.0.0 (#1470) * 1.19.10 Automatically generated by python-semantic-release * feat: add CUREv1 retrieval dataset (#1459) * feat: add CUREv1 dataset --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> * feat: add missing domains to medical tasks * feat: modify benchmark tasks * chore: benchmark naming --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> * Update tasks table * 1.20.0 Automatically generated by python-semantic-release * fix: check if `model` attr of model exists (#1499) * check if model attr of model exists * lint * Fix retrieval evaluator * 1.20.1 Automatically generated by python-semantic-release * fix: Leaderboard demo data loading (#1507) * Made get_scores error tolerant * Added join_revisions, made get_scores failsafe * Fetching metadata fixed fr HF models * Added failsafe metadata fetching to leaderboard code * Added revision joining to leaderboard app * fix * Only show models that have metadata, when filter_models is called * Ran linting * 1.20.2 Automatically generated by python-semantic-release * fix: leaderboard only shows models that have ModelMeta (#1508) Filtering for models that have metadata * 1.20.3 Automatically generated by python-semantic-release * fix: align readme with current mteb (#1493) * align readme with current mteb * align with mieb branch * fix test * 1.20.4 Automatically generated by python-semantic-release * docs: Add lang family mapping and map to task table (#1486) * add lang family mapping and map to task table * make lint * add back some unclassified lang codes * Update tasks table * fix: Ensure that models match the names on embedding-benchmarks/results (#1519) * 1.20.5 Automatically generated by python-semantic-release * fix: Adding missing metadata on models and mathcing names up with the results repo (#1528) * Added Voyage 3 models * Added correct metadata to Cohere models and matched names with the results repo * 1.20.6 Automatically generated by python-semantic-release * feat: Evaluate missing splits (#1525) * fix: evaluate missing splits (#1268) * implement partial evaluation for missing splits * lint * requested changes done from scratch * test for missing split evaluation added * uncomment test * lint * avoid circular import * use TaskResult * skip tests for now --------- Co-authored-by: Isaac Chung <[email protected]> * got test_all_splits_evaluated passing * tests passing * address review comments * make lint * handle None cases for kg_co2_emissions * use new results info --------- Co-authored-by: Thivyanth <[email protected]> * 1.21.0 Automatically generated by python-semantic-release * fix: Correct typos superseeded -> superseded (#1532) fix typo -> superseded * 1.21.1 Automatically generated by python-semantic-release * fix: Task load data error for SICK-BR-STS and XStance (#1534) * fix task load data for two tasks * correct dataset keys * 1.21.2 Automatically generated by python-semantic-release * fix: Proprietary models now get correctly shown in leaderboard (#1530) * Fixed showing proprietary models in leaderboard * Added links to all OpenAI models * Fixed table formatting issues * Bumped Gradio version * 1.21.3 Automatically generated by python-semantic-release * docs: Add Model Meta parameters and metadata (#1536) * add multi_qa_MiniLM_L6_cos_v1 model meta * add all_mpnet_base_v2 * add parameters to model meta * make lint * add extra params to meta * fix: add more model meta (jina, e5) (#1537) * add e5 model meta * address review comments * 1.21.4 Automatically generated by python-semantic-release * Add cohere models (#1538) * fix: bug cohere names * format * fix: add nomic models (#1543) #1515 * fix: Added all-minilm-l12-v2 (#1542) #1515 * fix: Added arctic models (#1541) #1515 * fix: add sentence trimming to OpenAIWrapper (#1526) * fix: add sentence trimming to OpenAIWrapper * fix: import tiktoken library inside encode function * fix: check tokenizer library installed and update ModelMeta to pass tokenizer_name * fix: pass tokenizer_name, max_tokens to loader * fix: make tokenizer_name None for default * fix: delete changes for ModelMeta * fix: fix revision to 2 for OpenAI models * fix: add docstring for OpenAIWrapper * fix: lint * feat: add openai optional dependency set * fix: add sleep for too many requests * fix: add lint * fix: delete evaluate file * 1.21.5 Automatically generated by python-semantic-release * fix: Fixed metadata errors (#1547) * 1.21.6 Automatically generated by python-semantic-release * fix: remove curev1 from multlingual (#1552) Seems like it was added here: 1cc6c9e * 1.21.7 Automatically generated by python-semantic-release * fix: Add Model2vec (#1546) * Added Model2Vec wrapper * Added Model2vec models * Added model2vec models to registry * Added model2vec as a dependency * Ran linting * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Added adapted_from and superseeded_by to model2vec models. * Added missing import * Moved pyproject.toml to optional dependencies * Fixed typos * Added import error and changed model to model_name * Added Numpy to frameworks * Added Numpy to frameworks * Corrected false info on model2vec models * Replaced np.inf with maxint * Update mteb/models/model2vec_models.py Co-authored-by: Isaac Chung <[email protected]> * Added option to have infinite max tokens, added it to Model2vec --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: Isaac Chung <[email protected]> * Made result loading more permissive, changed eval splits for HotPotQA and DBPedia (#1554) * Removed train and dev from eval splits on HotpotQA * Removed dev from eval splits on DBPedia * Made task_results validation more permissive * Readded exception in get_score * Ran linting * 1.21.8 Automatically generated by python-semantic-release * docs: Correction of SICK-R metadata (#1558) * Correction of SICK-R metadata * Correction of SICK-R metadata --------- Co-authored-by: rposwiata <[email protected]> * feat(google_models): fix issues and add support for `text-embedding-005` and `text-multilingual-embedding-002` (#1562) * fix: google_models batching and prompt * feat: add text-embedding-005 and text-multilingual-embedding-002 * chore: `make lint` errors * fix: address PR comments * 1.22.0 Automatically generated by python-semantic-release * fix(bm25s): search implementation (#1566) fix: bm25s implementation * 1.22.1 Automatically generated by python-semantic-release * docs: Fix dependency library name for bm25s (#1568) * fix: bm25s implementation * correct library name --------- Co-authored-by: Daniel Buades Marcos <[email protected]> * fix: Add training dataset to model meta (#1561) * fix: Add training dataset to model meta Adresses #1556 * Added docs * format * feat: (cohere_models) cohere_task_type issue, batch requests and tqdm for visualization (#1564) * feat: batch requests to cohere models * fix: use correct task_type * feat: use tqdm with openai * fix: explicitely set `show_progress_bar` to False * fix(publichealth-qa): ignore rows with `None` values in `question` or `answer` (#1565) * 1.23.0 Automatically generated by python-semantic-release * fix wongnai * update inits * fix tests * lint * update imports * fix tests * lint --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <[email protected]> Co-authored-by: Márton Kardos <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Napuh <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Thivyanth <[email protected]> Co-authored-by: Youngjoon Jang <[email protected]> Co-authored-by: Rafał Poświata <[email protected]>

* Update tasks table * 1.19.0 Automatically generated by python-semantic-release * fix: Add the_ugly_duckling.txt for speedtask to Python wheel (#1402) Add the_ugly_duckling.txt for speedtask to Python wheel * 1.19.1 Automatically generated by python-semantic-release * fix: Added the necessary trust_remote_code (#1406) * 1.19.2 Automatically generated by python-semantic-release * docs: Update recommendation for pushing results (#1401) fix: Update recommendation for pushing results * docs: Fix a typo in README (#1430) Fix typo in readme * fix: add logging for RetrievalEvaluator NaN values for similarity scores (#1398) Fixes #1389 * 1.19.3 Automatically generated by python-semantic-release * fix: make samples_per_label a task attribute (#1419) make samples_per_label a task attr * fix: Add Korean AutoRAGRetrieval (#1388) * feat: add AutoRAG Korean embedding retrieval benchmark * fix: run --- 🧹 Running linters --- ruff format . # running ruff formatting 716 files left unchanged ruff check . --fix # running ruff linting All checks passed! * fix: add metadata for AutoRAGRetrieval * change link for markers_bm * add AutoRAGRetrieval to init.py and update metadata * add precise metadata * update metadata: description and license * delete descriptive_stats in AutoRAGRetrieval.py and run calculate_matadata_metrics.py * fix: Add missing benchmarks in benchmarks.py (#1431) Fixes #1423 * Update tasks table * 1.19.4 Automatically generated by python-semantic-release * Leaderboard 2.0: added performance x n_parameters plot + more benchmark info (#1437) * Added elementary speed/performance plot * Refactored table formatting code * Bumped Gradio version * Added more general info to benchmark description markdown block * Adjusted margin an range on plot * Made hover information easier to read on plot * Made range scaling dynamic in plot * Moved citation next to benchmark description * Made titles in benchmark info bold * Leaderboard: Fixed code benchmarks (#1441) * fixed code benchmarks * fix: Made n_parameters formatting smarter and more robust * fix: changed jina-embeddings-v3 number of parameters from 572K to 572M * fix: Fixed use_instuctions typo in model overview * fix: Fixed sentence-transformer compatibility switch * Ran linting * Added all languages, tasks, types and domains to options * Removed resetting options when a new benchmark is selected * All results now get displayed, but models that haven't been run on everything get nan values in the table * fix: Count unique texts, data leaks in calculate metrics (#1438) * add more stat * add more stat * update statistics * fix: update task metadata to allow for null (#1448) * Update tasks table * 1.19.5 Automatically generated by python-semantic-release * Fix: Made data parsing in the leaderboard figure more robust (#1450) Bugfixes with data parsing in main figure * Fixed task loading (#1451) * Fixed task result loading from disk * Fixed task result loading from disk * fix: publish (#1452) * 1.19.6 Automatically generated by python-semantic-release * fix: Fix load external results with `None` mteb_version (#1453) * fix * lint * 1.19.7 Automatically generated by python-semantic-release * WIP: Polishing up leaderboard UI (#1461) * fix: Removed column wrapping on the table, so that it remains readable * Added disclaimer to figure * fix: Added links to task info table, switched out license with metric * fix: loading pre 1.11.0 (#1460) * small fix * fix: fix * 1.19.8 Automatically generated by python-semantic-release * fix: swap touche2020 to maintain compatibility (#1469) swap touche2020 for parity * 1.19.9 Automatically generated by python-semantic-release * docs: Add sum per language for task counts (#1468) * add sum per lang * add sort by sum option * make lint * fix: pinned datasets to <3.0.0 (#1470) * 1.19.10 Automatically generated by python-semantic-release * feat: add CUREv1 retrieval dataset (#1459) * feat: add CUREv1 dataset --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> * feat: add missing domains to medical tasks * feat: modify benchmark tasks * chore: benchmark naming --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> * Update tasks table * 1.20.0 Automatically generated by python-semantic-release * fix: check if `model` attr of model exists (#1499) * check if model attr of model exists * lint * Fix retrieval evaluator * 1.20.1 Automatically generated by python-semantic-release * fix: Leaderboard demo data loading (#1507) * Made get_scores error tolerant * Added join_revisions, made get_scores failsafe * Fetching metadata fixed fr HF models * Added failsafe metadata fetching to leaderboard code * Added revision joining to leaderboard app * fix * Only show models that have metadata, when filter_models is called * Ran linting * 1.20.2 Automatically generated by python-semantic-release * fix: leaderboard only shows models that have ModelMeta (#1508) Filtering for models that have metadata * 1.20.3 Automatically generated by python-semantic-release * fix: align readme with current mteb (#1493) * align readme with current mteb * align with mieb branch * fix test * 1.20.4 Automatically generated by python-semantic-release * docs: Add lang family mapping and map to task table (#1486) * add lang family mapping and map to task table * make lint * add back some unclassified lang codes * Update tasks table * fix: Ensure that models match the names on embedding-benchmarks/results (#1519) * 1.20.5 Automatically generated by python-semantic-release * fix: Adding missing metadata on models and mathcing names up with the results repo (#1528) * Added Voyage 3 models * Added correct metadata to Cohere models and matched names with the results repo * 1.20.6 Automatically generated by python-semantic-release * feat: Evaluate missing splits (#1525) * fix: evaluate missing splits (#1268) * implement partial evaluation for missing splits * lint * requested changes done from scratch * test for missing split evaluation added * uncomment test * lint * avoid circular import * use TaskResult * skip tests for now --------- Co-authored-by: Isaac Chung <[email protected]> * got test_all_splits_evaluated passing * tests passing * address review comments * make lint * handle None cases for kg_co2_emissions * use new results info --------- Co-authored-by: Thivyanth <[email protected]> * 1.21.0 Automatically generated by python-semantic-release * fix: Correct typos superseeded -> superseded (#1532) fix typo -> superseded * 1.21.1 Automatically generated by python-semantic-release * fix: Task load data error for SICK-BR-STS and XStance (#1534) * fix task load data for two tasks * correct dataset keys * 1.21.2 Automatically generated by python-semantic-release * fix: Proprietary models now get correctly shown in leaderboard (#1530) * Fixed showing proprietary models in leaderboard * Added links to all OpenAI models * Fixed table formatting issues * Bumped Gradio version * 1.21.3 Automatically generated by python-semantic-release * docs: Add Model Meta parameters and metadata (#1536) * add multi_qa_MiniLM_L6_cos_v1 model meta * add all_mpnet_base_v2 * add parameters to model meta * make lint * add extra params to meta * fix: add more model meta (jina, e5) (#1537) * add e5 model meta * address review comments * 1.21.4 Automatically generated by python-semantic-release * Add cohere models (#1538) * fix: bug cohere names * format * fix: add nomic models (#1543) #1515 * fix: Added all-minilm-l12-v2 (#1542) #1515 * fix: Added arctic models (#1541) #1515 * fix: add sentence trimming to OpenAIWrapper (#1526) * fix: add sentence trimming to OpenAIWrapper * fix: import tiktoken library inside encode function * fix: check tokenizer library installed and update ModelMeta to pass tokenizer_name * fix: pass tokenizer_name, max_tokens to loader * fix: make tokenizer_name None for default * fix: delete changes for ModelMeta * fix: fix revision to 2 for OpenAI models * fix: add docstring for OpenAIWrapper * fix: lint * feat: add openai optional dependency set * fix: add sleep for too many requests * fix: add lint * fix: delete evaluate file * 1.21.5 Automatically generated by python-semantic-release * fix: Fixed metadata errors (#1547) * 1.21.6 Automatically generated by python-semantic-release * fix: remove curev1 from multlingual (#1552) Seems like it was added here: 1cc6c9e * 1.21.7 Automatically generated by python-semantic-release * fix: Add Model2vec (#1546) * Added Model2Vec wrapper * Added Model2vec models * Added model2vec models to registry * Added model2vec as a dependency * Ran linting * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Added adapted_from and superseeded_by to model2vec models. * Added missing import * Moved pyproject.toml to optional dependencies * Fixed typos * Added import error and changed model to model_name * Added Numpy to frameworks * Added Numpy to frameworks * Corrected false info on model2vec models * Replaced np.inf with maxint * Update mteb/models/model2vec_models.py Co-authored-by: Isaac Chung <[email protected]> * Added option to have infinite max tokens, added it to Model2vec --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: Isaac Chung <[email protected]> * Made result loading more permissive, changed eval splits for HotPotQA and DBPedia (#1554) * Removed train and dev from eval splits on HotpotQA * Removed dev from eval splits on DBPedia * Made task_results validation more permissive * Readded exception in get_score * Ran linting * 1.21.8 Automatically generated by python-semantic-release * docs: Correction of SICK-R metadata (#1558) * Correction of SICK-R metadata * Correction of SICK-R metadata --------- Co-authored-by: rposwiata <[email protected]> * feat(google_models): fix issues and add support for `text-embedding-005` and `text-multilingual-embedding-002` (#1562) * fix: google_models batching and prompt * feat: add text-embedding-005 and text-multilingual-embedding-002 * chore: `make lint` errors * fix: address PR comments * 1.22.0 Automatically generated by python-semantic-release * fix(bm25s): search implementation (#1566) fix: bm25s implementation * 1.22.1 Automatically generated by python-semantic-release * docs: Fix dependency library name for bm25s (#1568) * fix: bm25s implementation * correct library name --------- Co-authored-by: Daniel Buades Marcos <[email protected]> * fix: Add training dataset to model meta (#1561) * fix: Add training dataset to model meta Adresses #1556 * Added docs * format * feat: (cohere_models) cohere_task_type issue, batch requests and tqdm for visualization (#1564) * feat: batch requests to cohere models * fix: use correct task_type * feat: use tqdm with openai * fix: explicitely set `show_progress_bar` to False * fix(publichealth-qa): ignore rows with `None` values in `question` or `answer` (#1565) * 1.23.0 Automatically generated by python-semantic-release * fix: Added metadata for miscellaneous models (#1557) * Added script for generating metadata, and metadata for the listed models * Added misc models to overview * Fixed misc metas * Removed unnecessary imports * Added logic to retrieve base model information * Added base models to misc meta * Added superseded_by to sentence-croissant models * Added training datasets to mis models * 1.23.1 Automatically generated by python-semantic-release * fix: Added radar chart displaying capabilities on task types (#1570) * Added radar chart displaying capabilities on task types * Fixed table aggregation in leaderboard * Spelled out why instructionretrieval is excluded * 1.23.2 Automatically generated by python-semantic-release * feat: add new arctic v2.0 models (#1574) * feat: add new arctic v2.0 models * chore: make lint * 1.24.0 Automatically generated by python-semantic-release * fix: Add namaa MrTydi reranking dataset (#1573) * Add dataset class and file requirements * pass tests * make lint changes * adjust meta data and remove load_data --------- Co-authored-by: Omar Elshehy <[email protected]> * Update tasks table * 1.24.1 Automatically generated by python-semantic-release * fix: Eval langs not correctly passed to monolingual tasks (#1587) * fix SouthAfricanLangClassification.py * add check for langs * lint * 1.24.2 Automatically generated by python-semantic-release * feat: Add ColBert (#1563) * feat: add max_sim operator for IR tasks to support multi-vector models * docs: add doc for Model2VecWrapper.__init__(...) * feat: add ColBERTWrapper to models & add ColBERTv2 * fix: resolve issues * fix: resolve issues * Update README.md Co-authored-by: Roman Solomatin <[email protected]> * Update README.md Co-authored-by: Isaac Chung <[email protected]> * Update README.md Co-authored-by: Isaac Chung <[email protected]> * Update mteb/evaluation/evaluators/RetrievalEvaluator.py Co-authored-by: Isaac Chung <[email protected]> * Update README.md Co-authored-by: Isaac Chung <[email protected]> * README.md: rm subset * doc: update example for Late Interaction * get colbert running without errors * fix: pass is_query to pylate * fix: max_sim add pad_sequence * feat: integrate Jinja templates for ColBERTv2 and add model prompt handling * feat: add revision & prompt_name * doc: pad_sequence * rm TODO jina colbert v2 * doc: warning: higher resource usage for MaxSim --------- Co-authored-by: sam021313 <[email protected]> Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Isaac Chung <[email protected]> * 1.25.0 Automatically generated by python-semantic-release * doc: colbert add score_function & doc section (#1592) * doc: colbert add score_function & doc section * doc: Update README.md Co-authored-by: Kenneth Enevoldsen <[email protected]> * doc: Update README.md Co-authored-by: Isaac Chung <[email protected]> --------- Co-authored-by: sam021313 <[email protected]> Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: Isaac Chung <[email protected]> * Feat: add support for scoring function (#1594) * add support for scoring function * lint * move similarity to wrapper * remove score function * lint * remove from InstructionRetrievalEvaluator * Update mteb/evaluation/evaluators/RetrievalEvaluator.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * remove score function from README.md --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Add new models nvidia, gte, linq (#1436) * Add new models nvidia, gte, linq * add warning for gte-Qwen and nvidia models re: instruction used in docs as well --------- Co-authored-by: isaac-chung <[email protected]> * Leaderboard: Refined plots (#1601) * Added embedding size guide to performance-size plot, removed shading on radar chart * Changed plot names to something more descriptive * Made plots failsafe * fix: Leaderboard refinements (#1603) * Added explanation of aggregate measures * Added download button to result tables * Task info gets sorted by task name * Added custom, shareable links for each benchmark * Moved explanation of aggregate metrics to the summary tab * 1.25.1 Automatically generated by python-semantic-release * Feat: Use similarity scores if available (#1602) * Use similarity scores if available * lint * Add NanoBEIR Datasets (#1588) * add NanoClimateFeverRetrieval task, still requires some debugging * move task to correct place in init file * add all Nano datasets and results * format code * Update mteb/tasks/Retrieval/eng/tempCodeRunnerFile.py Co-authored-by: Roman Solomatin <[email protected]> * pin revision to commit and add datasets to benchmark.py * create new benchmark for NanoBEIR * add revision when loading datasets * lint --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: isaac-chung <[email protected]> * Update tasks table * Feat: Evaluate missing languages (#1584) * init * fix tests * update mock retrieval * update tests * use subsets instead of langs * Apply suggestions from code review Co-authored-by: Isaac Chung <[email protected]> * fix tests * add to readme * rename subset in readme --------- Co-authored-by: Isaac Chung <[email protected]> * Add IBM Granite Embedding Models (#1613) * add IBM granite embedding models * lint formatting * add adapted_from and superseded_by to ModelMeta * fix: disable co2_tracker for API models (#1614) * 1.25.2 Automatically generated by python-semantic-release * fix: set `use_instructions` to True in models using prompts (#1616) feat: set `use_instructions` to True in models using prompts * 1.25.3 Automatically generated by python-semantic-release * fix: override existing results (#1617) * fix override existing results * lint * fix tests * add tests with overwrite * lint * update tests * lint * update * lint * 1.25.4 Automatically generated by python-semantic-release * add MSMARCO eval split in MTEB English (classic) benchmark (#1620) * add MSMARCO eval split in MTEB English (classic) benchmark Fixes #1608 * Add co-author Co-authored-by: aashka-trivedi <[email protected]> --------- Co-authored-by: aashka-trivedi <[email protected]> * fix: GermanDPR Dataset Causes Cross-Encoder Failure Due to Unexpected dict (#1621) Fixes #1609 * fix: properly add mteb_model_meta to model object (#1623) * 1.25.5 Automatically generated by python-semantic-release * Feat: Add jasper (#1591) * init jasper * init jasper * add to overview * add to overview * remove some params * fix max length * return sdpa * add dtype * add dtype * fix convert_to_tensor * change to encode * return whitespace processing * explicitly add instructions * move seq length * try float * fix max_seq_length * add prompt validation to format instruction * don't use instructions only to s2p * fix: Update results_to_dataframe to use BenchmarkResults class (#1628) * 1.25.6 Automatically generated by python-semantic-release * Speed up test_save_predictions (#1631) * fix: Correction of discrepancies for gte-Qweb model (#1637) * 1.25.7 Automatically generated by python-semantic-release * fix: output_folder for co2 evaluation (#1642) * 1.25.8 Automatically generated by python-semantic-release * fix: add missing benchmark to benchmarks.py (#1641) add missing benchmark * 1.25.9 Automatically generated by python-semantic-release * fix: Cast all Model2Vec outputs as floats (#1667) cast all outputs as floats * 1.25.10 Automatically generated by python-semantic-release * fix: Update gritlm kwargs (#1643) * Fix kwarg * format --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * 1.25.11 Automatically generated by python-semantic-release * fix: Use batch size kwargs for openai APIs (#1668) Fixes #1645 * 1.25.12 Automatically generated by python-semantic-release * fix: Pass trust_remote_code=True to CPM model (#1669) Fixes #1651 * 1.25.13 Automatically generated by python-semantic-release * fix: Updated metadata for CPM (#1670) * fix: Pass trust_remote_code=True to CPM model Fixes #1651 * fix: Updated metadata for cpm * 1.25.14 Automatically generated by python-semantic-release * fix: remove model as a parameter for MulticlassClassification (#1666) remove model parameter * fix: Use prompts instead of prompt names for voyage (#1665) * fix prompt names * lint * change input type * 1.25.15 Automatically generated by python-semantic-release * fix: Update BUCC dataset revision (#1674) * trust remote code * Update revision * 1.25.16 Automatically generated by python-semantic-release * fix: Add warning for non-retrieval tasks when using bm25s (#1678) * clean up install instruction * add check for bm25s and skip non-retrieval tasks * add versions * 1.25.17 Automatically generated by python-semantic-release * fix: add check for key error in loader (#1675) * add check for key error * make KeyError everywhere * update error * 1.25.18 Automatically generated by python-semantic-release * fix: trust remote code for snowflake-arctic-embed-m-v2.0 (#1682) trust remote code * 1.25.19 Automatically generated by python-semantic-release * fix: nomic tensor return (#1683) * fix nomic tensor return * add typehint * 1.25.20 Automatically generated by python-semantic-release * feat: add `avsolatorio/NoInstruct-small-Embedding-v0` (#1677) add no_instruct * fix: arg name for openbmb/MiniCPM-Embedding (#1691) fix name * 1.26.0 Automatically generated by python-semantic-release * fix: add trust_remote_code to Snowflake/snowflake-arctic-embed-m-long (#1695) trust remote code * fix: add revision for jinaai/jina-embeddings-v2-small-en (#1692) add revision * 1.26.1 Automatically generated by python-semantic-release * fix: update model loader to trust remote code (#1697) update model loader * 1.26.2 Automatically generated by python-semantic-release * fix: nomic prompts (#1685) * fix nomic prompts * fix variable model name * pass prompts to model * use sentence transformer wrapper * update prompts * lint * update prompts * update list for classification * fix: NanoBeir (#1687) * fix nano beir * lint * 1.26.3 Automatically generated by python-semantic-release * Update RerankingEvaluator.py (#1702) * fix: Register MicroLlama Text Embedding (#1644) Register MicroLlama Text Embedding * fix: GermanDPR (#1703) * fix GermanDPR * lint * 1.26.4 Automatically generated by python-semantic-release * Fix: minicpmv2 (#1705) * updmini cpm * flash_attn implementation * remove flash attn * ci: Refresh the v2 leaderboard daily (#1711) * Create leaderboard_refresh.yaml * Shorten and fix * factory reset instead of normal * Fix: typos in adding a model (#1722) * fix: rollback BUCC revision (#1706) * fix bucc * fix logger * upd evaluator * add comment * lint * 1.26.5 Automatically generated by python-semantic-release * fix: Added zero shot tag to benchmark (#1710) * Added method for determining whether a model is zero shot * Added .items() where intended * Added filtering functions for zero shot models * Added zero-shot filtering button and error message when table is empty.: * Ran linting * Fixed docstring linting error * is_zero_shot returns None when no training data is specified * Added zero-shot emoji column to leaderboard * Added explanation for zero shot column * Added soft and hard zero-shot buttons * Added training data annotations to 24 models from HuggingFace Hub * 1.26.6 Automatically generated by python-semantic-release * feat: reduce logging for load_results() - redacts missing subsets to avoid 100+ subsets printed - reduce to logging.info - removed splits that are commonly never evaluated on and thus also the errors for them being missing The second part removed quite a few warnings (4930 to XX) It also seems like the splits were accidentally included in some of the MMTEB benchmark. This will remove those splits from those benchmarks (which are all in beta). We will have to recompute the tables for the paper though (we should do that anyway) Other potential thing to consider: - Scifact is included in MTEB(Medical). I have removed the "train" split from it as I think that was a mistake. (checked other dataset in benchmark) Here is a count of the current top errors: ```py { "MassiveScenarioClassification: Missing splits {'validation'}": 238, # included in e.g. mteb(fra) "MassiveIntentClassification: Missing splits {'validation'}": 237, # included in e.g. mteb(fra) "MassiveScenarioClassification: Missing subsets {'af', 'da', ...} for split test": 230, "AmazonReviewsClassification: Missing splits {'validation'}": 229, # included in e.g. mteb(deu) "MassiveIntentClassification: Missing subsets {'af', 'da', ...} for split test": 228, "STS22: Missing subsets {'fr-pl', 'de-en', ...} for split test": 223, "AmazonReviewsClassification: Missing subsets {'es', 'ja', ...} for split test": 196, "MTOPDomainClassification: Missing splits {'validation'}": 195, # included in mteb(fra) "MTOPIntentClassification: Missing splits {'validation'}": 194, # included in mteb(fra) "AmazonCounterfactualClassification: Missing splits {'validation'}": 189, # included in mteb(deu) "MTOPDomainClassification: Missing subsets {'es', 'th', ...} for split test": 165, "STS17: Missing subsets {'en-ar', 'es-es', ...} for split test": 164, "MTOPIntentClassification: Missing subsets {'es', 'th', ...} for split test": 164, "AmazonCounterfactualClassification: Missing subsets {'de', 'ja', ...} for split test": 148, } ``` * 1.27.0 Automatically generated by python-semantic-release * feat: Add nomic modern bert (#1684) * add nomic modern bert * use SentenceTransformerWrapper * use SentenceTransformerWrapper * try nomic wrapper * update * use all prompts * pass prompts * use fp16 * lint * change to version * remove commented code * fix: allow kwargs in init for RerankingWrapper (#1676) * allow kwargs in init * fix retrieval * convert corpus_in_pair to list * 1.28.0 Automatically generated by python-semantic-release * Fixed result loading on leaderboard (#1739) * Only main_score gets loaded for leaderboard thereby avoiding OOM errors * Fixed plot failing because of missing embedding dimensions * Ran linting * test: Add script to test model loading below n_parameters threshold (#1698) * add model loading test for models below 2B params * add failure message to include model namne * use the real get_model_meta * use cache folder * teardown per function * fix directory removal * write to file * wip loading from before * wip * Rename model_loading_testing.py to model_loading.py * Delete tests/test_models/test_model_loading.py * checks for models below 2B * try not using cache folder * update script with scan_cache_dir and add args * add github CI: detect changed model files and run model loading test * install all model dependencies * dependecy installations and move file location * should trigger a model load test in CI * find correct commit for diff * explicity fetch base branch * add make command * try to run in python instead and add pytest * fix attribute error and add read mode * separate script calling * let pip install be cached and specify repo path * check ancestry * add cache and rebase * try to merge instead of rebase * try without merge base * check if file exists first * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update .github/workflows/model_loading.yml Co-authored-by: Kenneth Enevoldsen <[email protected]> * address review comments to run test once from CI and not pytest --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * fix: Leaderboard Speedup (#1745) * Added get_scores_fast * Made leaderboard faster with smarter dependency graph and event management and caching * Changed print to logger.info * 1.28.1 Automatically generated by python-semantic-release * fix: Fixed task_type aggregation on leaderboard (#1746) * Fixed task_type aggregation in leaderboard * Fixed an error due to unneccesary indentation in get_score * 1.28.2 Automatically generated by python-semantic-release * fix: Fixed definition of zero-shot in ModelMeta (#1747) * Corrected zero_shot definition to be based on task names, not dataset path * 1.28.3 Automatically generated by python-semantic-release * fix: fixes implementation of similarity() (#1748) * fix(#1594): fixes implementation of similarity() * fix: add similarity to SentenceTransformerWrapper --------- Co-authored-by: sam021313 <[email protected]> * 1.28.4 Automatically generated by python-semantic-release * fix: Leaderboard: `K` instead of `M` (#1761) Fixes #1752 * other: add script for leaderboard compare (#1758) * add script * remove changes * remove changes * add comment * lint * order like in benchmark object * round results * 1.28.5 Automatically generated by python-semantic-release * fix: added annotations for training data (#1742) * fix: Added annotations for arctic embed models * added google and bge * added cohere * Added e5 * added bge based model2vec * annotated oAI * format and update annotations * 1.28.6 Automatically generated by python-semantic-release * fix: update max tokens for OpenAI (#1772) update max tokens * ci: skip AfriSentiLID for now (#1785) * skip AfriSentiLID for now * skip relevant test case instead --------- Co-authored-by: Isaac Chung <[email protected]> * 1.28.7 Automatically generated by python-semantic-release * ci: fix model loading test (#1775) * pass base branch into the make command as an arg * test a file that has custom wrapper * what about overview * just dont check overview * revert instance check * explicitly omit overview and init * remove test change * try on a lot of models * revert test model file --------- Co-authored-by: Isaac Chung <[email protected]> * feat: Update task filtering, fixing bug which included cross-lingual tasks in overly many benchmarks (#1787) * feat: Update task filtering, fixing bug on MTEB - Updated task filtering adding exclusive_language_filter and hf_subset - fix bug in MTEB where cross-lingual splits were included - added missing language filtering to MTEB(europe, beta) and MTEB(indic, beta) The following code outlines the problems: ```py import mteb from mteb.benchmarks import MTEB_ENG_CLASSIC task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] # was eq. to: task = mteb.get_task("STS22", languages=["eng"]) task.hf_subsets # correct filtering to English datasets: # ['en', 'de-en', 'es-en', 'pl-en', 'zh-en'] # However it should be: # ['en'] # with the changes it is: task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] task.hf_subsets # ['en'] # eq. to task = mteb.get_task("STS22", hf_subsets=["en"]) # which you can also obtain using the exclusive_language_filter (though not if there was multiple english splits): task = mteb.get_task("STS22", languages=["eng"], exclusive_language_filter=True) ``` * format * remove "en-ext" from AmazonCounterfactualClassification * fixed mteb(deu) * fix: simplify in a few areas * 1.29.0 Automatically generated by python-semantic-release * fix: Added C-MTEB (#1786) Added C-MTEB * 1.29.1 Automatically generated by python-semantic-release * docs: Add contact to MMTEB benchmarks (#1796) * Add myself to MMTEB benchmarks * lint * fix: loading pre 11 (#1798) * fix loading pre 11 * add similarity * lint * run all task types * 1.29.2 Automatically generated by python-semantic-release * fix: allow to load no revision available (#1801) * fix allow to load no revision available * lint * add require_model_meta to leaderboard * lint * 1.29.3 Automatically generated by python-semantic-release * fix: Zero shot and aggregation on Leaderboard (#1810) * Made join_revision filter out no_revision_available when other revisions have been run on the task * Fixed zero-shot filtering * Fixed aggregation of task types * Ran linting * fix: Added `ModelMeta` for BGE, GTE Chinese and multilingual models (#1811) * Added BGE Chinese and multilingual-gemma models * Added GTE multilingual and Chinese models * Fixed date format * 1.29.4 Automatically generated by python-semantic-release * fix: Add additional contacts (#1817) add contacts from #1790 * Update points table * 1.29.5 Automatically generated by python-semantic-release * fix: Added more Chinese models' `ModelMeta` (#1814) * Added Multilingual USE models * Added Moka models * Added dmeta models * Added jina-zh * Added piccolo models * 1.29.6 Automatically generated by python-semantic-release * Add model inf-retriever-v1 (#1744) * feat(models): add infly/inf-retriever-v1 model metadata- Add inf_models.py file with metadata for infly/inf-retriever-v1 model - Update overview.py to include inf_models in model imports * Reformat code * Update inf-retriever-v1 ModelMeta * Fill more information for inf-retriever-v1 * Add license information for inf-retriever-v1 --------- Co-authored-by: Samuel Yang <[email protected]> * ci: only return 1 model_name per file (#1818) * only return 1 model_name per file * fix args parse * revert test change * fix: add bge-m3 `ModelMeta` (#1821) add bge * 1.29.7 Automatically generated by python-semantic-release * fix: Added Chinese Stella models (#1824) Added Chinese Stella models * fix: bm25s (#1827) Co-authored-by: sam021313 <[email protected]> * fix: Added way more training dataset annotations (#1765) * fix: Leaderboard: `K` instead of `M` Fixes #1752 * format * fixed existing annotations to refer to task name instead of hf dataset * added annotation to nvidia * added voyage * added uae annotations * Added stella annotations * sentence trf models * added salesforce and e5 * jina * bge + model2vec * added llm2vec annotations * add jasper * format * format * Updated annotations and moved jina models * fix: add even more training dataset annotations (#1793) * fix: update max tokens for OpenAI (#1772) update max tokens * ci: skip AfriSentiLID for now (#1785) * skip AfriSentiLID for now * skip relevant test case instead --------- Co-authored-by: Isaac Chung <[email protected]> * 1.28.7 Automatically generated by python-semantic-release * ci: fix model loading test (#1775) * pass base branch into the make command as an arg * test a file that has custom wrapper * what about overview * just dont check overview * revert instance check * explicitly omit overview and init * remove test change * try on a lot of models * revert test model file --------- Co-authored-by: Isaac Chung <[email protected]> * feat: Update task filtering, fixing bug which included cross-lingual tasks in overly many benchmarks (#1787) * feat: Update task filtering, fixing bug on MTEB - Updated task filtering adding exclusive_language_filter and hf_subset - fix bug in MTEB where cross-lingual splits were included - added missing language filtering to MTEB(europe, beta) and MTEB(indic, beta) The following code outlines the problems: ```py import mteb from mteb.benchmarks import MTEB_ENG_CLASSIC task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] # was eq. to: task = mteb.get_task("STS22", languages=["eng"]) task.hf_subsets # correct filtering to English datasets: # ['en', 'de-en', 'es-en', 'pl-en', 'zh-en'] # However it should be: # ['en'] # with the changes it is: task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] task.hf_subsets # ['en'] # eq. to task = mteb.get_task("STS22", hf_subsets=["en"]) # which you can also obtain using the exclusive_language_filter (though not if there was multiple english splits): task = mteb.get_task("STS22", languages=["eng"], exclusive_language_filter=True) ``` * format * remove "en-ext" from AmazonCounterfactualClassification * fixed mteb(deu) * fix: simplify in a few areas * fix: Add gritlm * 1.29.0 Automatically generated by python-semantic-release * fix: Added more annotations! * fix: Added C-MTEB (#1786) Added C-MTEB * 1.29.1 Automatically generated by python-semantic-release * docs: Add contact to MMTEB benchmarks (#1796) * Add myself to MMTEB benchmarks * lint * fix: loading pre 11 (#1798) * fix loading pre 11 * add similarity * lint * run all task types * 1.29.2 Automatically generated by python-semantic-release * fix: allow to load no revision available (#1801) * fix allow to load no revision available * lint * add require_model_meta to leaderboard * lint * 1.29.3 Automatically generated by python-semantic-release --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Márton Kardos <[email protected]> --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Márton Kardos <[email protected]> * fix: Added Misc Chinese models (#1819) * Added moka and piccolo models to overview file * Added Text2Vec models * Added various Chinese embedding models --------- Co-authored-by: Isaac Chung <[email protected]> * 1.29.8 Automatically generated by python-semantic-release * fix: Fixed eval split for MultilingualSentiment in C-MTEB (#1804) * Fixed eval split for MultilingualSentiment in C-MTEB * FIxed splits for atec, bq and stsb in C-MTEB * 1.29.9 Automatically generated by python-semantic-release * fix: subsets to run (#1830) * fix split evals * add test * lint * fix moka * add assert * fix: Remove default params, `public_training_data` and `memory usage` in `ModelMeta` (#1794) * fix: Leaderboard: `K` instead of `M` Fixes #1752 * format * fixed existing annotations to refer to task name instead of hf dataset * added annotation to nvidia * added voyage * added uae annotations * Added stella annotations * sentence trf models * added salesforce and e5 * jina * bge + model2vec * added llm2vec annotations * add jasper * format * format * Updated annotations and moved jina models * make models parameters needed to be filled * fix tests * remove comments * remove model meta from test * fix model meta from split * fix: add even more training dataset annotations (#1793) * fix: update max tokens for OpenAI (#1772) update max tokens * ci: skip AfriSentiLID for now (#1785) * skip AfriSentiLID for now * skip relevant test case instead --------- Co-authored-by: Isaac Chung <[email protected]> * 1.28.7 Automatically generated by python-semantic-release * ci: fix model loading test (#1775) * pass base branch into the make command as an arg * test a file that has custom wrapper * what about overview * just dont check overview * revert instance check * explicitly omit overview and init * remove test change * try on a lot of models * revert test model file --------- Co-authored-by: Isaac Chung <[email protected]> * feat: Update task filtering, fixing bug which included cross-lingual tasks in overly many benchmarks (#1787) * feat: Update task filtering, fixing bug on MTEB - Updated task filtering adding exclusive_language_filter and hf_subset - fix bug in MTEB where cross-lingual splits were included - added missing language filtering to MTEB(europe, beta) and MTEB(indic, beta) The following code outlines the problems: ```py import mteb from mteb.benchmarks import MTEB_ENG_CLASSIC task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] # was eq. to: task = mteb.get_task("STS22", languages=["eng"]) task.hf_subsets # correct filtering to English datasets: # ['en', 'de-en', 'es-en', 'pl-en', 'zh-en'] # However it should be: # ['en'] # with the changes it is: task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] task.hf_subsets # ['en'] # eq. to task = mteb.get_task("STS22", hf_subsets=["en"]) # which you can also obtain using the exclusive_language_filter (though not if there was multiple english splits): task = mteb.get_task("STS22", languages=["eng"], exclusive_language_filter=True) ``` * format * remove "en-ext" from AmazonCounterfactualClassification * fixed mteb(deu) * fix: simplify in a few areas * fix: Add gritlm * 1.29.0 Automatically generated by python-semantic-release * fix: Added more annotations! * fix: Added C-MTEB (#1786) Added C-MTEB * 1.29.1 Automatically generated by python-semantic-release * docs: Add contact to MMTEB benchmarks (#1796) * Add myself to MMTEB benchmarks * lint * fix: loading pre 11 (#1798) * fix loading pre 11 * add similarity * lint * run all task types * 1.29.2 Automatically generated by python-semantic-release * fix: allow to load no revision available (#1801) * fix allow to load no revision available * lint * add require_model_meta to leaderboard * lint * 1.29.3 Automatically generated by python-semantic-release --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Márton Kardos <[email protected]> * fig merges * update models info * change public_training_code to str * change `public_training_code=False` to None * remove annotations * remove annotations * remove changed annotations * remove changed annotations * remove `public_training_data` and `memory usage` * make framework not optional * make framework non-optional * empty frameworks * add framework * fix tests * Update mteb/models/overview.py Co-authored-by: Isaac Chung <[email protected]> --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Márton Kardos <[email protected]> * 1.29.10 Automatically generated by python-semantic-release * fix: Add reported annotation and re-added public_training_data (#1846) * fix: Add additional dataset annotations * fix: readded public training data * update voyage annotations * 1.29.11 Automatically generated by python-semantic-release * fix: Leaderboard Refinements (#1849) * Added better descriptions to benchmarks and removed beta tags * Fixed zero-shot filtering on app loading * Added zero-shot definition in an accordion * NaN values are now filled with blank * Added type hints to filter_models * 1.29.12 Automatically generated by python-semantic-release * rest of the merge conflicts * fix merge conflicts * fill in model meta defaults * fix ModeMeta modalities * fix metadata pydantic errors; * assert model.model instead since it is a wrapper * fix: Fixed leaderboard search bar (#1852) Fixed leaderboard search bar * 1.29.13 Automatically generated by python-semantic-release * fix: Hotfixed public_training_data type annotation (#1857) Fixed public_training_data flag type to include boolean, as this is how all models are annotated * fix: Fix zeta alpha mistral (#1736) * fix zeta alpha mistral * update use_instructions * update training datasets * Update mteb/models/e5_instruct.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * update float * Update mteb/models/e5_instruct.py --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Add more annotations (#1833) * apply additions from #1794 * add annotations for rumodels * add nomic training data * fix metadata * update rest of model meta * fix bge reranker * 1.29.14 Automatically generated by python-semantic-release * fix: Adding missing model meta (#1856) * Added CDE models * Added bge-en-icl * Updated CDE to bge_full_data * Fixed public_training_data flag type to include boolean, as this is how all models are annotated * Added public training data link instead of bool to CDE and BGE * Added GME models * Changed Torch to PyTorch * Added metadata on LENS models * Added ember_v1 * Added metadata for amazon titan * Removed GME implementation * fix Encoder class --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions <[email protected]> Co-authored-by: Helena Kloosterman <[email protected]> Co-authored-by: Alexey Vatolin <[email protected]> Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: Elias H <[email protected]> Co-authored-by: Youngjoon Jang <[email protected]> Co-authored-by: Márton Kardos <[email protected]> Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Napuh <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Thivyanth <[email protected]> Co-authored-by: Rafał Poświata <[email protected]> Co-authored-by: Omar Elshehy <[email protected]> Co-authored-by: Omar Elshehy <[email protected]> Co-authored-by: Sam <[email protected]> Co-authored-by: sam021313 <[email protected]> Co-authored-by: KGupta10 <[email protected]> Co-authored-by: Aashka Trivedi <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: chenghao xiao <[email protected]> Co-authored-by: Ken Wang <[email protected]> Co-authored-by: Orion Weller <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Samuel Yang <[email protected]> Co-authored-by: Samuel Yang <[email protected]>

* mieb ZeroshotClassification * mieb docs * mieb implementation demo * model meta; abstask column names; linear probe clf * model meta; abstask column names; linear probe clf * fix: update naming as candidate_labels * Update README.md * Update README.md * i2tretrieval * test load data ignore i2tretrieval * [MIEB] Add image clustering (#1088) * make lint * wip * add TinyImageNet and run * type hints * add accuracy * lint * remove unused & fix typos * T2I Retrieval * Any2AnyRetrieval * fix tests from merge * [MIEB] Add image text pair classification and tests (#1099) * add ImageTextPairClassification abstask and evaluator * dataset transform into sequence of images for each sample * fix processing logic; list of list images compatability * lint and docstrings * make lint * fix failing tests in TaskMetadata * add tests for mieb * skip gated repo --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [MIEB] Add image classification and zero shot classification tasks (#1101) * fix task metadata * use overrideable column names * add CIFAR datasets * add caltech101 dataset * add FGVC aircraft dataset * add food 101 dataset * add OxfordPets dataset * remove comments * correct cifar100 path * update cifar100 classification results * cifar zero shot results * add caltech101 zero shot * matching CLIP paper implementation * add aircraft and food zero shot * add oxford pets zero shot * [MIEB] Add CIFAR clustering (#1104) add CIFAR clustering * [MIEB] Add more image classification and zero shot classification datasets (#1103) * update category to i2t * add MNIST linear probe and zero shot * add FER2013 linear probe and zero shot * add stanford cars linear probe and zero shot * add birdsnap linear probe and zero shot * add eurosat linear probe and zero shot * lint * correct eurosat zero shot labels * add abstask for image multilable and voc2007 * make lint * [MIEB] Add more image classification and zero shot datasets (#1105) * add STL10 linear probe and zero shot * add RESISC45 linear probe and zeor shot * add Describable textures linear probe and zero shot * fix spacing lint * add SUN397 linear probe and zero shot * correct SUN397 zero shot captions * add baai bge vista * add e5-v * linting * memory issues for image linear probe & zeroshot * kknn linear probe arguments * del comments * Add some classification and ZeroShot classification tasks (#1107) * Add Country211 classification task * Add imagenet1k classification task * Add UCF101 classification task * Add PatchCamelyon Classification task * Add GTSRB classification task * Add GSTRB Zero Shot Classification * Add country211 zero shot classification * Add results for classification tasks * Add zero shot classification tasks * Add PatchCamelyon tasks and results * Add linting * Add results and fix prompts for zero shot * Add results * Add results and linting * fix dependency & clip mock test * [MIEB] Add jina clip (#1120) * add jina clip and mscoco i2t and t2i results * make lint * [MIEB] Update `mieb` with the `main` branch and some fixes (#1126) * fix instruction retrival (#1072) * fix instruction retrival * fix test * add points * make nested results * add test * skip instruction test * fix instruction passes * fix unions * move do_length_ablation Co-authored-by: Kenneth Enevoldsen <[email protected]> --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update points table * fix: fix bug-causing spelling error in function name of e5-mistral-instruct (#1106) found bug * 1.12.85 Automatically generated by python-semantic-release * fix: MultilingualSentimentClassification (#1109) * Update points table * fix: Avoid spaces in dataset name for CQADupstack and ignore speed tasks * 1.12.86 Automatically generated by python-semantic-release * fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended (#1112) * fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended * fix: fixed formatting for cli * docs: improve searchability in the advanced usage documentation * 1.12.87 Automatically generated by python-semantic-release * docs: improve searchability in the advanced usage documentation (#1113) * docs: improve searchability in the advanced usage documentation * docs: update based on corrections * fix: export type for `mteb create_meta` (#1114) * fix export type * fix dataset version too * 1.12.88 Automatically generated by python-semantic-release * fix: Simplify models implementations (#1085) * Merge * Adapt * Simplify * Check for rev again * Rmv cmmnt * Simplify * simplify * Rmv comment Co-authored-by: Kenneth Enevoldsen <[email protected]> * Use logging; change try except; add info * Lint * Rmv results * Update rev * format * Simplify models; Allow instructions * Jobs * Fix merge * Format * Adapt models * fix: ensure that e5 ignores the NQ * format --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * 1.12.89 Automatically generated by python-semantic-release * fix: nomic models using prefix correctly (#1125) * fix: nomic models using prefix correctly * chore: remove comment * fix: handling in case not torch tensor * Fix typo --------- Co-authored-by: Niklas Muennighoff <[email protected]> * 1.12.90 Automatically generated by python-semantic-release * refactor vista model wrapper to contain lib import * python 38 type hints --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: anpalmak2003 <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: Zach Nussbaum <[email protected]> Co-authored-by: chenghao xiao <[email protected]> * image memoery issues for all retrieval Abstasks * Add CLEVR and SciMMIR Image-Text Understanding tasks (#1127) * Add CLEVER and SciMMIR * Update metadata * remove useless comment * Add linting * fix typo and tests * Add CLEVR count task * add linting * add fashion200k & fashionIQ test passed * clip text max seq truncation * add WebQA, NIGHTS, OVEN * any2any retrieval chunk encoding * add nomic vision model; any2any topk bug * add cv recall * add InfoSeek; VisualNews * [MIEB] Add Stanford Cars i2i Retrieval (#1147) * wip * add results * make lint * change back the order * [MIEB] Add CUB200 i2i retrieval (#1154) * add cub200 and results * add skip_first_result * skipped self and rerun results * consolidate i2t and t2i to any2any * remove abstask and evaluators * remove references from test * tu-add berlin sketch retrieval * XM3600; XFlickr30kCO; mutilingual * wit multilingual retrieval t2i * correct multilingual t2i meta * meta * add dinov2 model; 4 sizes * cls evaluator channel bug fix * add ALIGN model * add FORBI2IRetrieval * forb & tuberlin new revision * disable tokenization parallelism * add hateful meme retrieval i2tt2i * add memotion retrieval t2ii2t * add SciMMIR Retrieval i2tt2i * ruff update * Visual STS Abstask&evaluator * add visual STS17 * add visual STS 12-16 * [mieb] Add blip and blip2 models, and ImageNetDog15Clustering task (#1226) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * [mieb] add 3 compositionality evaluation tasks (#1229) * linting & update unavailable dataset path * add aro visual relation&attribution; sugarcrepe * correct reference * add SOPI2IRetrieval dataset/task (#1232) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * change reference * Image text pair cls (#1233) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix meta data * fix validate points --------- Co-authored-by: Isaac Chung <[email protected]> * Add RP2kI2IRetrieval and METI2IRetrieval (#1239) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * [MIEB] Adding DataComp CLIP models (#1283) * adding data comp CLIP models * update model and caltech101 results * make lint * [mieb] Any2TextMultipleChoice Abstask&Evaluator & four tasks in CV-bench (#1287) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix meta data * fix validate points * CV-Bench * evaluator args comment * fix --------- Co-authored-by: Isaac Chung <[email protected]> * [mieb] adding 10 tasks (#1290) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add vidore benchmark 10 tasks * fix reference * fix old metadata * fix meta * [mieb] Adding MOCOv3 models (#1293) * add moco models first try * add as a timm model * add large model results * make lint * [mieb] Add more Any2AnyRetrieval datasets (#1285) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * remove GLDv2I2IRetrieval * [mieb] Add any2any multiple choice evaluator and abstask (and one task) (#1301) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * [mieb] Fix FORB dataset (#1306) * correct format * update results * add more results * add more results * [mieb] run tasks fix (#1302) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix e5v&vista * task type fix for running tasks * fix wrong meta * run mieb script * script * lint * align * [mieb] split RParisI2IRetrieval and ROxfordI2IRetrieval into easy, medium and hard versions (#1305) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] run tasks small fix (#1310) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix e5v&vista * task type fix for running tasks * fix wrong meta * run mieb script * script * lint * align * fix * linting * [mieb] Add VLM2vec (#1323) * wip vlm2vec model * making i2t classification work wit Calteh101 * test vlm2vec on other task types * move peft into class * feat: Merge main into MIEB (#1329) * fix: OpenAI BadRequestError by limiting input dimensions to 2048 elem… (#1203) * fix: OpenAI BadRequestError by limiting input dimensions to 2048 elements (#1201) Fix OpenAI BadRequestError by limiting input dimensions to 2048 elements - Ensure the 'sentences' list passed to OpenAI API does not exceed 2048 elements - Reference: OpenAI's Embedding API documentation on input limits Co-authored-by: Ali Shiraee <[email protected]> * fix ruff formatting * Added minor test fixes to ensure reproducility across systems * Ensure that tmp.json is not created within repo when running tests * format * fixes path issues * Rerun CI --------- Co-authored-by: HSILA <[email protected]> Co-authored-by: Ali Shiraee <[email protected]> * fix: Ensure STS pearson and spearman does not use the p-value only the correlation (#1207) Fixes #1206 * 1.14.16 Automatically generated by python-semantic-release * fix: Normalize licenses including casing, uses of "-" etc. * fix: Normalize licenses including casing, uses of "-" etc. (#1210) * fix: Normalize licenses including casing, uses of "-" etc. * fix tests * 1.14.17 Automatically generated by python-semantic-release * fix: Normalize benchmarks no only include task objects and added getter for benchmarks (#1208) * Normalize benchmarks to only include tasks - Force benchmarks to only include tasks. This fixes a few bugs where benchmarks can reference a task which is not implemented - implements `mteb.get_benchmark`, which makes it easier to fetch benchmarks - Added tests + updated docs A few outstanding issues: I would like `mteb.MTEB(benchmark)` to always reproduce the benchmark. Currently this is not possible as MTEB(eng) required the split to be specified. A solution it to allow "eval_splits) to be specified when initializing a task and then pass it on to the `load_data()`. This way we can write the following: `mteb.get_tasks(tasks=[...], eval_splits=["test"], ...)` I would also love the aggregation to be a part of the benchmark (such that it is clear how it should be aggregated). This is especially relevant for MTEB(eng) as it average the CQAD datasets before creating the global average. This way we can also create a result object for the benchmark itself. A complimenting solution for this is to allow nested benchmarks. * fix error in tests * format * Added corrections based on review * added example and formatted * 1.14.18 Automatically generated by python-semantic-release * docs: Fix broken links in docs (#1212) * Added fixes for broken links in adding_a_dataset and adding_a_model docs. * Updated link name * Mismatch of the category of AmazonPolarityClassification (#1220) Fixes #1219 * Update tasks table * fix: Ensure that results are returned even when hitting cache (#1215) Fixes #1122 * 1.14.19 Automatically generated by python-semantic-release * fix: Allow benchmark to specify eval_splits (#1217) * fix: Allow benchmark to specify eval_splits This PR allow for benchmarks to specify specific eval. splits. This allow us to fully specify a benchmark within the benchmark object. To do this it add the following: - added eval_splits to the Abstask object, which default to metadata.eval_splits - use the task.eval_splits unless overwritten in mteb.MTEB.run - added eval_splits arg to mteb.get_tasks, which filter the tasks based on splits - updated documentation - renamed the "Advanced Usage" to "Usage Documentation" to make it more accicible - added tests where relevant * Added correction based on feedback * 1.14.20 Automatically generated by python-semantic-release * Update points table * Update points table * docs: clarify adding a model (#1222) * fix: Add RepLLaMA style models (#1223) * init commit * working and reproducing * lint * update hashes * warning * add pyproject * Update points table * 1.14.21 Automatically generated by python-semantic-release * docs: Update points (#1228) * Fix case * Fix casing * Fix case * Fix case * Create 971.jsonl * Update contrib * Add contributors * Update points table * docs: Add MTEB(code) dataset (#1237) * docs: Add MTEB(code) dataset * Fix linting * Update points table * Update of my affiliation (#1242) Update points.md * Add contributor (#1243) * fix: @mrshu's name in `points.md` (#1246) * Use the diacritic character to be inline with Slovak spelling. Signed-off-by: mr.Shu <[email protected]> * docs: Create benchmarks overview table (#1245) * fix get_benchmarks method * add create benchmark script * make lint * 1.14.22 Automatically generated by python-semantic-release * docs: Update affiliation (#1247) Update points.md * Added author-information * Add final author list * Update points table * docs: Added coordination point for Jimmy Lee (#1253) docs: Added coordination point for Jimmy lee for his work on the coordination of Crystina and Nandan * Update points table * fix: Add multilingual Benchmark (#1252) * fix: Add multilingual bench * Update mteb/benchmarks/benchmarks.py Co-authored-by: Niklas Muennighoff <[email protected]> * format --------- Co-authored-by: Niklas Muennighoff <[email protected]> * 1.14.23 Automatically generated by python-semantic-release * docs: Small point changes & more contributors (#1254) * Update points.md * Fix format * Fix attribution * Update points table * fix: Downsample large retrieval datasets (#1236) * most tasks * lint * fix other issues * refactor * lint and docs * add polish * keep case sensitive mteb paths * add potential points * fix points * fix test about metadata * update tasks and stats * lint * Update points table * Update tasks table * 1.14.24 Automatically generated by python-semantic-release * fix: Get meta from CrossEncoder (#1255) * remove indent after return * handle cross encoders for model meta * make lint * update filename since we now have model name * 1.14.25 Automatically generated by python-semantic-release * fix: Add listing all available benchmarks CLI option (#1256) * add benchmarks.md in README * add cli option * add benchmark cli test case * correct typo * 1.14.26 Automatically generated by python-semantic-release * docs: Update affiliation (#1248) * Update points.md * Update points.md --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * docs: Update mteb(eng) calculation (#1258) * Update mteb(eng) calculation * Fixed citations * Update MTEB(eng) + MTEB(multilingual) * feat: leverage SentenceTransformers' query/passage specific prompts (#1221) * feat: leverage SentenceTransformer models' query/passage specific prompts * refactor: remove E5Wrapper fix: wrong e5 revisions * fix: default prompt_type to None * fix: e4ce987 revision no longer exists for multilingual-e5-small on the Hub * fix: keep `prompt_name` in kwargs when model doesn't have a `prompts` attr * feat: use Enum for `prompt_type` * docs: specify how to use prompts with Sentence Transformers * feat: readd arctic models due to metadata * 1.15.0 Automatically generated by python-semantic-release * fix: Add Touche2020v3 and JMTEB (#1262) * add datasets * fix metrics * add Touche2020v3 * fix metadata * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <[email protected]> * upd name and supress * add benchmark class --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update tasks table * 1.15.1 Automatically generated by python-semantic-release * fix: Select benchmarks CLI option (#1261) * add test case for a list of Benchmarks * add selecting benchmarks CLI option * typos * use a separate attribute for benchmarks * try fixing tests * should accept string as well * revert filename change * use Benchmark and avoid circular import * fix: derive `results_directory` path from `results_repo` name (#1275) fix: don't hardcode repo name when downloading results * 1.15.2 Automatically generated by python-semantic-release * fix: sorting benchmark tasks by MTEB, then alphabetical (#1271) * sorted * fixed formatting * efficiency changes * fix test * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * 1.15.3 Automatically generated by python-semantic-release * ci: Removed 3.8 dependency (#1281) Changes include: - remove 3.8 from tests (added 3.11 and 3.12) - changed other CI to 3.9 - updated lint rules to use 3.8 * Update points table * fix: Allow Numpy >=2.0 (#1264) Allow Numpy >=2.0 * 1.15.4 Automatically generated by python-semantic-release * docs: points for paper writing (#1286) * Create 1004.jsonl * Create 1006.jsonl * Update docs/mmteb/points/1004.jsonl * Update docs/mmteb/points/1006.jsonl --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update points table * Update points table * Update points table * docs: Fix a link in the README (#1289) * Fix a link in the README And fix some typos. * Update README.md * Update points table * fix: Update benchmarks (#1288) * make benchmark var name uppercase * update touche to v3 * add MIRACLRetrievalHardNegatives to multilingual * add mteb(indic) * add eu benchmark * 1.15.5 Automatically generated by python-semantic-release * fix: Allow numpy<2.0.0 (#1291) * 1.15.6 Automatically generated by python-semantic-release * fix: Add metadata dict to QBQTC in C-MTEB (#1292) * fix QBQTC in C-MTEB * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * 1.15.7 Automatically generated by python-semantic-release * fix: Remove non-existent eval split of CMNLI (#1294) fix eval_splits of CMNLI * 1.15.8 Automatically generated by python-semantic-release * Leaderboard (#1235) * Add leaderboard dev * Renamed MTEBResults to TaskResult * Moved model and model meta loading utilities into overview.py * Added get_model_metas to retrieve filtered metadata for models * Restructured results object and made it into a class instead of a dict * Added utilities for filtering models on BenchmarkResults objects * Added to_table utility function to BenchmarkResults * Added serialization utilities to BenchmarkResults * Attempted fixing tests * Added get_model_metas to __init__ * Added get_benchmarks to __init__ and made it return all benchmarks by default * Added get_benchmarks to __init__ * Made tasks hashable * Added task filtering based on task objects on BenchmarkResults * Added BenchmarkResults to __init__ * Added additional arguments to get_scores on two classes * Made get_scores smarter on BenchmarkResult * Added basic multilingual benchmark * Modified benchmark to be able to easily access results * Added useful properties and filtering functions to BenchmarkResults * Added minimal functioning example * Added smarter table, task-list updating and tried fixing dropdown scrolling * Made restrict_results into a private function Co-authored-by: Kenneth Enevoldsen <[email protected]> * Removed old leaderboard scripts * Hardcoded max and min model size * Removed redundant utils file * Ran linting * added leaderboard dependencies as optional * Fixed union type error on Python 3.9 * Removed references to Dict in task aggregation * Fixed name errors in _restrict_task_results * Fixed _restrict_task_results * Made hf_subsets={'default'} when the task is monolingual in _restric_task_results * Task dropdown now gets filtered based on the other criteria * Ran linting again * Introduced hotfix for reranking test * Added BenchmarkResults to __all__ in __init__ * Fixed validate_and_filter_scores method, and replaced _restric_task_results with it --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * feat: Use prompts instead of encode_corpus and encode_queries (#1278) * add prompt per task type * fix prompt * upd test * lint * fix test * fix DeprecatedSummarizationEvaluator * fix prompts * add test * lint * logger info * use task type only in model_encode * lint * update interface * add prompt types to docs * fix test * mock tasks * mock task registry * remove last task_type * fix tests * lint * fix test * fix * use wrapper and new prompts * fix tests * lint * fix test * remove conftest * validate task to prompt_name * override model prompts * task to prompt name optional * fix tests * fix models * remove task_to_prompt_name * remove from mteb __init__ * update docs * load existing model prompts if model_prompts is None * fix * lint * change wrapper loader * add wrapper class * lint * add wrapper file * update logging * upd logging * refactor reranking * lint * remove prints * 1.16.0 Automatically generated by python-semantic-release * fix: Add Retrieval SK Quad dataset for Slovak search evaluation (#1276) * Add Retrieval SK Quad dataset for Slovak search evaluation This commit introduces the Retrieval SK Quad dataset, designed to assess Slovak search performance. The dataset is derived from SK-QuAD and includes questions with their best answers categorized post-annotation. This addition provides a significant resource for advancing Slovak language search evaluation and supporting further research and development. * Add Retrieval SK Quad dataset for Slovak search evaluation 2 Added the requested changes on the SKQuadRetrieval.py file * add task to init * add missing task metadata --------- Co-authored-by: Isaac Chung <[email protected]> * Update tasks table * 1.16.1 Automatically generated by python-semantic-release * fix: Add Slovak Hate Speech and Offensive Language Dataset (#1274) * Add Slovak Hate Speech and Offensive Language Dataset This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update. * Add Slovak Hate Speech and Offensive Language Dataset - Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability. * Did requested changes: - Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability. * resolve linting issues by running `make lint` * Update tasks table * WIP: Leaderboard UI improvements (#1312) * Fixed typos in task_results * Fixed typos in task_results * Added Tailwind, reorganized layout and fixed scrolling * Ran linting * 1.16.2 Automatically generated by python-semantic-release * fix: remove duplicate multilingual * 1.16.3 Automatically generated by python-semantic-release * fix: Re-upload dataset to hub to avoid using script upload (#1322) * fix dataset upload * add linting * Update tasks table * 1.16.4 Automatically generated by python-semantic-release * fix: Add implementations of common reranker models (#1309) * init * revert * revert * add metadata * lint * add reqs * change to float16 * benchmark lint fix * 1.16.5 Automatically generated by python-semantic-release * Add multilingual mFollowIR dataset (#1308) * add mFollowIR * paper name * edit warning->info * convert to parquet * lint * Update tasks table * Cache the embeddings when requested (#1307) * add caching * update test to use close * change from json to pkl * fix for window * cleanup on Windows again * infer dimension * move cachewrapper * add wrapper * fix * updates * fix tests * fix lint * lint * add test * WIP: Leaderboard UI improvements (#1320) * Fixed typos in task_results * Fixed typos in task_results * Added Tailwind, reorganized layout and fixed scrolling * Ran linting * Removed faux benchmark * Updated layout * Changed table number format * Table highlights highest values by making them bold * Added rank to table, removed organization from model_name * Added mean rank to table * Ran linting * feat: Update metadata for all models (#1316) * Added model meta * format * fixed metadata * Metadata update for voyage models * Update mteb/models/cohere_models.py Co-authored-by: Roman Solomatin <[email protected]> * Update mteb/models/cohere_models.py Co-authored-by: Roman Solomatin <[email protected]> * Added corrections from review * fix spelling error --------- Co-authored-by: Roman Solomatin <[email protected]> * resolved bugs from pytest --collect-only * Avoid wrapping all models with the SentenceTransformerWrapper * Added normalize_embeddings_to_numpy to ensure standard embeddings during evaluations * fixed moved on correction from @Samoed * conditionally set .predict method on SentenceTransformerWrapper --------- Signed-off-by: mr.Shu <[email protected]> Co-authored-by: HSILA <[email protected]> Co-authored-by: Ali Shiraee <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Thomas van Dongen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: Orion Weller <[email protected]> Co-authored-by: John Yang <[email protected]> Co-authored-by: Imene Kerboua <[email protected]> Co-authored-by: Marek Šuppa <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Xa9aX ツ <[email protected]> Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Sathvik Nallamalli <[email protected]> Co-authored-by: Michael Graczyk <[email protected]> Co-authored-by: Mariya Hendriksen <[email protected]> Co-authored-by: Santiago Castro <[email protected]> Co-authored-by: Joey Xia <[email protected]> Co-authored-by: Márton Kardos <[email protected]> Co-authored-by: Oliver <[email protected]> * [mieb] Add OpenCLIP models (#1335) * add open clip models * Update __init__.py * lint * fix model overview * update jina clip --------- Co-authored-by: chenghao xiao <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] new version with downsampled train split to 32 per class (#1327) * new version with downsampled train split to 32 per class * force load truncated image file * make lint * add open clip models * Update __init__.py * lint * fix model overview * fix ImageCLS undersample; run birdsnap * make lint * make lint --------- Co-authored-by: chenghao xiao <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix Jina CLIP (#1349) fix jina clip v1 * fix: Add clevr license (#1356) * Add BLINK as multi-choice tasks (#1348) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] add Eva CLIP models (#1369) * add Eva CLIP models * make lint * [mieb] add siglip, cohere multimodal & some fixes for final run (#1357) * fix dataset type error * fix clustering metrics * add siglip & cohere * update mieb run script * cohere-v import * fix * api key name * [mieb] fixes for final run (#1374) * e5_v device arg * dataloader num_workers * vista doc * vista doc * run mieb * fix * Update run_vista.md * [mieb] Fix torch no grad (#1378) Fix torch no grad * [mieb] Fix vlm2vec (#1380) * fix vlm2vec return dtype * make lint * [mieb] Remove null entries from corpus of ROxford, RParis (#1371) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] fixes (#1390) * Fix torch no grad * simplify * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * [MIEB] Remove non-existent method for blip (#1394) remove non-existent method for blip * [mieb] fix ALIGN; update Winoground revision id; update run script (#1391) * fix align & winoground * lint * Convert task category to i2i for tasks that only calls image encode * update categories should include img cls, clustering, and multi label clf * no op * no op * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * [mieb] Fix open clip for cv bench count (#1397) fix shape mismatch * [mieb] Update subtasks of BLINKIT2TMultiChoice and BLINKIT2IMultiChoice (#1403) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis * fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice * update blink metadata * add updated BLINK results --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix EVA CLIP for CV Bench (#1414) * unsqueeze after preprocess * make lint * [mieb] Add calculate probs for vlm2vec (#1418) * add method * make lint * [mieb] Fix siglip bug & add retrieval datasets (#1424) * fix siglip * add edis&gld-v2 i2i * results * siglip updated results * fix siglip non-dataloader tasks * [mieb] use Logistic Regression classifier for AbsTaskImageMultilabelClassification (#1420) * use moc-lr classifier * set n_experiments=5 * run dinov2 and some laion models * add dinov2-giant results * [mieb] mieb scripts (siglip rerun & linear probing ablation & params count) (#1429) * mieb scripts * lint * [MIEB] Change Flickr30k to test split (#1449) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis * fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice * update blink metadata * add updated BLINK results * merge upstream mieb * change Flickr30k to test split * change flickr to test split --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix VLM2vec dtype (#1462) * propagate dtype * fix fuse embeddings using list of PIL images * [mieb] run script for missing results (#1472) * task type fix * scripts * [mieb] Fix Moco model on CIFAR10Clustering (#1487) Fix Moco model on CIFAR10Clustering * [mieb] Fix Flickr30k I2T and T2I (#1505) * remake flickr30k it2 and t2i * add openai clip vit-b32 b16 and jina-clip results * make lint * [MIEB] add missing siglip models (#1533) * add udpates * lint errors * fix typo (#1535) * add udpates * lint errors * fix small typo * [mieb] Fix numbers of CIRR, Fashion200k, FashionIQ, Flickr30k, MSCOCO data statistics (#1544) fix numbers * Discussing a standard for ImageEncoders * Add Voyage's multimodal embedding (#1555) * add voyage multimodal & ran 17 tasks * lint * typo * clean * [mieb] update script for final re-run (#1576) * mieb final runs * lint * fix: no longer using same query text for all of BLINKIT2TMultiChoice (#1572) * fix: no longer using same query text for all of BLINKIT2TMultiChoice * fix: remove blink subtask * fix: remove subtask from blink it2i * fix: align BLINK retrieval to multi choice * add ROxford and RParis I2I multi choice * add retrieval metrics to multi choice evaluator * fix: remove wrong negatives from revisiting multichoice datasets * fix revisiting datasets * add new results for revisiting multichoice * [MIEB] Make multimodal models compatible to `task_name` and `prompt_type` (#1583) * 1. Make `get_xxx_embeddings` follow `encode`. 2. `ImageDataset.transform` could be `None`. * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <[email protected]> * Fix arguments * Try to fix tests --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * fix image encoder (#1596) * format * fixed tests * lint * [mieb] voyage-v: add exponential backoff and other error handling (#1610) * add voyage multimodal & ran 17 tasks * lint * typo * clean * exponential backoff tmp * downsize large images for voyage api call * voyage error handling * lint * add more results * make tenacity optional * lint * log * [MIEB] Fix `get_fused_emebddings` (#1612) * Fix fused * fix vlm2vec * Fix lint * [MIEB] Add new multimodal retrieval tasks (#1611) * Add new tasks * Fix score type * [MIEB] Switch to ViDoRe BEIR version (#1607) * Fix ViDoRe corpus * fix lint * ViDoRe beir version * Extend MIEB test coverage (#1629) * add one task from each image AbsTask to test grid * add visual sts to test grid * [mieb] Task filtering by modality supported by models (#1633) * fix function signature for moco loader * filter out tasks by model modalities * correct conditions * add model meta to relevant models * use modalities instead and separate out constants * [MIEB] Fix VISTA model (#1638) Fix vista * Warn (#1639) * [mieb] model task modalities matching logic (#1640) fixing task & model modalities matching logic * [mieb] Use mock abstask classes (#1648) * rename to downsampled_dataset_transform * add mock tasks for mieb * wip getting to 57% * make lint * update mock classes to improve coverage * omit mock tasks from some tests * [MIEB] Add code for GME models (#1635) * Add GME * Fix infoseek prompts * Merge instructions * fix: add version check e5-v in mieb (#1723) * add version check for e5v model * Update e5_v.py * make lint * fix: change comparison to bigger than (#1743) change comparison to bigger than * docs: Rework MIEB docs (#1802) * combine mieb docs and move to main docs folder * make flow more coherent * tidy up * skip AfriSentiLID for now #1785 * fix typo: exclude MIEB mock tests * update vista doc * Apply suggestions from code review --------- Co-authored-by: Isaac Chung <[email protected]> * [mieb] Remove results-mieb folder (#1815) remove results-mieb folder * [mieb] fixing lrap computation for multi-label classification (#1834) multi-label cls lrap computation fix * [mieb] Merge from main (#1853) * Update tasks table * 1.19.0 Automatically generated by python-semantic-release * fix: Add the_ugly_duckling.txt for speedtask to Python wheel (#1402) Add the_ugly_duckling.txt for speedtask to Python wheel * 1.19.1 Automatically generated by python-semantic-release * fix: Added the necessary trust_remote_code (#1406) * 1.19.2 Automatically generated by python-semantic-release * docs: Update recommendation for pushing results (#1401) fix: Update recommendation for pushing results * docs: Fix a typo in README (#1430) Fix typo in readme * fix: add logging for RetrievalEvaluator NaN values for similarity scores (#1398) Fixes #1389 * 1.19.3 Automatically generated by python-semantic-release * fix: make samples_per_label a task attribute (#1419) make samples_per_label a task attr * fix: Add Korean AutoRAGRetrieval (#1388) * feat: add AutoRAG Korean embedding retrieval benchmark * fix: run --- 🧹 Running linters --- ruff format . # running ruff formatting 716 files left unchanged ruff check . --fix # running ruff linting All checks passed! * fix: add metadata for AutoRAGRetrieval * change link for markers_bm * add AutoRAGRetrieval to init.py and update metadata * add precise metadata * update metadata: description and license * delete descriptive_stats in AutoRAGRetrieval.py and run calculate_matadata_metrics.py * fix: Add missing benchmarks in benchmarks.py (#1431) Fixes #1423 * Update tasks table * 1.19.4 Automatically generated by python-semantic-release * Leaderboard 2.0: added performance x n_parameters plot + more benchmark info (#1437) * Added elementary speed/performance plot * Refactored table formatting code * Bumped Gradio version * Added more general info to benchmark description markdown block * Adjusted margin an range on plot * Made hover information easier to read on plot * Made range scaling dynamic in plot * Moved citation next to benchmark description * Made titles in benchmark info bold * Leaderboard: Fixed code benchmarks (#1441) * fixed code benchmarks * fix: Made n_parameters formatting smarter and more robust * fix: changed jina-embeddings-v3 number of parameters from 572K to 572M * fix: Fixed use_instuctions typo in model overview * fix: Fixed sentence-transformer compatibility switch * Ran linting * Added all languages, tasks, types and domains to options * Removed resetting options when a new benchmark is selected * All results now get displayed, but models that haven't been run on everything get nan values in the table * fix: Count unique texts, data leaks in calculate metrics (#1438) * add more stat * add more stat * update statistics * fix: update task metadata to allow for null (#1448) * Update tasks table * 1.19.5 Automatically generated by python-semantic-release * Fix: Made data parsing in the leaderboard figure more robust (#1450) Bugfixes with data parsing in main figure * Fixed task loading (#1451) * Fixed task result loading from disk * Fixed task result loading from disk * fix: publish (#1452) * 1.19.6 Automatically generated by python-semantic-release * fix: Fix load external results with `None` mteb_version (#1453) * fix * lint * 1.19.7 Automatically generated by python-semantic-release * WIP: Polishing up leaderboard UI (#1461) * fix: Removed column wrapping on the table, so that it remains readable * Added disclaimer to figure * fix: Added links to task info table, switched out license with metric * fix: loading pre 1.11.0 (#1460) * small fix * fix: fix * 1.19.8 Automatically generated by python-semantic-release * fix: swap touche2020 to maintain compatibility (#1469) swap touche2020 for parity * 1.19.9 Automatically generated by python-semantic-release * docs: Add sum per language for task counts (#1468) * add sum per lang * add sort by sum option * make lint * fix: pinned datasets to <3.0.0 (#1470) * 1.19.10 Automatically generated by python-semantic-release * feat: add CUREv1 retrieval dataset (#1459) * feat: add CUREv1 dataset --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> * feat: add missing domains to medical tasks * feat: modify benchmark tasks * chore: benchmark naming --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> * Update tasks table * 1.20.0 Automatically generated by python-semantic-release * fix: check if `model` attr of model exists (#1499) * check if model attr of model exists * lint * Fix retrieval evaluator * 1.20.1 Automatically generated by python-semantic-release * fix: Leaderboard demo data loading (#1507) * Made get_scores error tolerant * Added join_revisions, made get_scores failsafe * Fetching metadata fixed fr HF models * Added failsafe metadata fetching to leaderboard code * Added revision joining to leaderboard app * fix * Only show models that have metadata, when filter_models is called * Ran linting * 1.20.2 Automatically generated by python-semantic-release * fix: leaderboard only shows models that have ModelMeta (#1508) Filtering for models that have metadata * 1.20.3 Automatically generated by python-semantic-release * fix: align readme with current mteb (#1493) * align readme with current mteb * align with mieb branch * fix test * 1.20.4 Automatically generated by python-semantic-release * docs: Add lang family mapping and map to task table (#1486) * add lang family mapping and map to task table * make lint * add back some unclassified lang codes * Update tasks table * fix: Ensure that models match the names on embedding-benchmarks/results (#1519) * 1.20.5 Automatically generated by python-semantic-release * fix: Adding missing metadata on models and mathcing names up with the results repo (#1528) * Added Voyage 3 models * Added correct metadata to Cohere models and matched names with the results repo * 1.20.6 Automatically generated by python-semantic-release * feat: Evaluate missing splits (#1525) * fix: evaluate missing splits (#1268) * implement partial evaluation for missing splits * lint * requested changes done from scratch * test for missing split evaluation added * uncomment test * lint * avoid circular import * use TaskResult * skip tests for now --------- Co-authored-by: Isaac Chung <[email protected]> * got test_all_splits_evaluated passing * tests passing * address review comments * make lint * handle None cases for kg_co2_emissions * use new results info --------- Co-authored-by: Thivyanth <[email protected]> * 1.21.0 Automatically generated by python-semantic-release * fix: Correct typos superseeded -> superseded (#1532) fix typo -> superseded * 1.21.1 Automatically generated by python-semantic-release * fix: Task load data error for SICK-BR-STS and XStance (#1534) * fix task load data for two tasks * correct dataset keys * 1.21.2 Automatically generated by python-semantic-release * fix: Proprietary models now get correctly shown in leaderboard (#1530) * Fixed showing proprietary models in leaderboard * Added links to all OpenAI models * Fixed table formatting issues * Bumped Gradio version * 1.21.3 Automatically generated by python-semantic-release * docs: Add Model Meta parameters and metadata (#1536) * add multi_qa_MiniLM_L6_cos_v1 model meta * add all_mpnet_base_v2 * add parameters to model meta * make lint * add extra params to meta * fix: add more model meta (jina, e5) (#1537) * add e5 model meta * address review comments * 1.21.4 Automatically generated by python-semantic-release * Add cohere models (#1538) * fix: bug cohere names * format * fix: add nomic models (#1543) #1515 * fix: Added all-minilm-l12-v2 (#1542) #1515 * fix: Added arctic models (#1541) #1515 * fix: add sentence trimming to OpenAIWrapper (#1526) * fix: add sentence trimming to OpenAIWrapper * fix: import tiktoken library inside encode function * fix: check tokenizer library installed and update ModelMeta to pass tokenizer_name * fix: pass tokenizer_name, max_tokens to loader * fix: make tokenizer_name None for default * fix: delete changes for ModelMeta * fix: fix revision to 2 for OpenAI models * fix: add docstring for OpenAIWrapper * fix: lint * feat: add openai optional dependency set * fix: add sleep for too many requests * fix: add lint * fix: delete evaluate file * 1.21.5 Automatically generated by python-semantic-release * fix: Fixed metadata errors (#1547) * 1.21.6 Automatically generated by python-semantic-release * fix: remove curev1 from multlingual (#1552) Seems like it was added here: https://github.com/embeddings-benchmark/mteb/commit/1cc6c9e0fe62ca4e77708b641823fa1a121f048b * 1.21.7 Automatically generated by python-semantic-release * fix: Add Model2vec (#1546) * Added Model2Vec wrapper * Added Model2vec models * Added model2vec models to registry * Added model2vec as a dependency * Ran linting * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Added adapted_from and superseeded_by to model2vec models. * Added missing import * Moved pyproject.toml to optional dependencies * Fixed typos * Added import error and changed model to model_name * Added Numpy to frameworks * Added Numpy to frameworks * Corrected false info on model2vec models * Replaced np.inf with maxint * Update mteb/models/model2vec_models.py Co-authored-by: Isaac Chung <[email protected]> * Added option to have infinite max tokens, added it to Model2vec --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: Isaac Chung <[email protected]> * Made result loading more permissive, changed eval splits for HotPotQA and DBPedia (#1554) * Removed train and dev from eval splits on HotpotQA * Removed dev from eval splits on DBPedia * Made task_results validation more permissive * Readded exception in get_score * Ran linting * 1.21.8 Automatically generated by python-semantic-release * docs: Correction of SICK-R metadata (#1558) * Correction of SICK-R metadata * Correction of SICK-R metadata --------- Co-authored-by: rposwiata <[email protected]> * feat(google_models): fix issues and add support for `text-embedding-005` and `text-multilingual-embedding-002` (#1562) * fix: google_models batching and prompt * feat: add text-embedding-005 and text-multilingual-embedding-002 * chore: `make lint` errors * fix: address PR comments * 1.22.0 Automatically generated by python-semantic-release * fix(bm25s): search implementation (#1566) fix: bm25s implementation * 1.22.1 Automatically generated by python-semantic-release * docs: Fix dependency library name for bm25s (#1568) * fix: bm25s implementation * correct library name --------- Co-authored-by: Daniel Buades Marcos <[email protected]> * fix: Add training dataset to model meta (#1561) * fix: Add training dataset to model meta Adresses #1556 * Added docs * format * feat: (cohere_models) cohere_task_type issue, batch requests and tqdm for visualization (#1564) * feat: batch requests to cohere models * fix: use correct task_type * feat: use tqdm with openai * fix: explicitely set `show_progress_bar` to False * fix(publichealth-qa): ignore rows with `None` values in `question` or `answer` (#1565) * 1.23.0 Automatically generated by python-semantic-release * fix: Added metadata for miscellaneous models (#1557) * Added script for generating metadata, and metadata for the listed models * Added misc models to overview * Fixed misc metas * Removed unnecessary imports * Added logic to retrieve base model information * Added base models to misc meta * Added superseded_by to sentence-croissant models * Added training datasets to mis models * 1.23.1 Automatically generated by python-semantic-release * fix: Added radar chart displaying capabilities on task types (#1570) * Added radar chart displaying capabilities on task types * Fixed table aggregation in leaderboard * Spelled out why instructionretrieval is excluded * 1.23.2 Automatically generated by python-semantic-release * feat: add new arctic v2.0 models (#1574) * feat: add new arctic v2.0 models * chore: make lint * 1.24.0 Automatically generated by python-semantic-release * fix: Ad…

* mieb ZeroshotClassification * mieb docs * mieb implementation demo * model meta; abstask column names; linear probe clf * model meta; abstask column names; linear probe clf * fix: update naming as candidate_labels * Update README.md * Update README.md * i2tretrieval * test load data ignore i2tretrieval * [MIEB] Add image clustering (#1088) * make lint * wip * add TinyImageNet and run * type hints * add accuracy * lint * remove unused & fix typos * T2I Retrieval * Any2AnyRetrieval * fix tests from merge * [MIEB] Add image text pair classification and tests (#1099) * add ImageTextPairClassification abstask and evaluator * dataset transform into sequence of images for each sample * fix processing logic; list of list images compatability * lint and docstrings * make lint * fix failing tests in TaskMetadata * add tests for mieb * skip gated repo --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [MIEB] Add image classification and zero shot classification tasks (#1101) * fix task metadata * use overrideable column names * add CIFAR datasets * add caltech101 dataset * add FGVC aircraft dataset * add food 101 dataset * add OxfordPets dataset * remove comments * correct cifar100 path * update cifar100 classification results * cifar zero shot results * add caltech101 zero shot * matching CLIP paper implementation * add aircraft and food zero shot * add oxford pets zero shot * [MIEB] Add CIFAR clustering (#1104) add CIFAR clustering * [MIEB] Add more image classification and zero shot classification datasets (#1103) * update category to i2t * add MNIST linear probe and zero shot * add FER2013 linear probe and zero shot * add stanford cars linear probe and zero shot * add birdsnap linear probe and zero shot * add eurosat linear probe and zero shot * lint * correct eurosat zero shot labels * add abstask for image multilable and voc2007 * make lint * [MIEB] Add more image classification and zero shot datasets (#1105) * add STL10 linear probe and zero shot * add RESISC45 linear probe and zeor shot * add Describable textures linear probe and zero shot * fix spacing lint * add SUN397 linear probe and zero shot * correct SUN397 zero shot captions * add baai bge vista * add e5-v * linting * memory issues for image linear probe & zeroshot * kknn linear probe arguments * del comments * Add some classification and ZeroShot classification tasks (#1107) * Add Country211 classification task * Add imagenet1k classification task * Add UCF101 classification task * Add PatchCamelyon Classification task * Add GTSRB classification task * Add GSTRB Zero Shot Classification * Add country211 zero shot classification * Add results for classification tasks * Add zero shot classification tasks * Add PatchCamelyon tasks and results * Add linting * Add results and fix prompts for zero shot * Add results * Add results and linting * fix dependency & clip mock test * [MIEB] Add jina clip (#1120) * add jina clip and mscoco i2t and t2i results * make lint * [MIEB] Update `mieb` with the `main` branch and some fixes (#1126) * fix instruction retrival (#1072) * fix instruction retrival * fix test * add points * make nested results * add test * skip instruction test * fix instruction passes * fix unions * move do_length_ablation Co-authored-by: Kenneth Enevoldsen <[email protected]> --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update points table * fix: fix bug-causing spelling error in function name of e5-mistral-instruct (#1106) found bug * 1.12.85 Automatically generated by python-semantic-release * fix: MultilingualSentimentClassification (#1109) * Update points table * fix: Avoid spaces in dataset name for CQADupstack and ignore speed tasks * 1.12.86 Automatically generated by python-semantic-release * fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended (#1112) * fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended * fix: fixed formatting for cli * docs: improve searchability in the advanced usage documentation * 1.12.87 Automatically generated by python-semantic-release * docs: improve searchability in the advanced usage documentation (#1113) * docs: improve searchability in the advanced usage documentation * docs: update based on corrections * fix: export type for `mteb create_meta` (#1114) * fix export type * fix dataset version too * 1.12.88 Automatically generated by python-semantic-release * fix: Simplify models implementations (#1085) * Merge * Adapt * Simplify * Check for rev again * Rmv cmmnt * Simplify * simplify * Rmv comment Co-authored-by: Kenneth Enevoldsen <[email protected]> * Use logging; change try except; add info * Lint * Rmv results * Update rev * format * Simplify models; Allow instructions * Jobs * Fix merge * Format * Adapt models * fix: ensure that e5 ignores the NQ * format --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * 1.12.89 Automatically generated by python-semantic-release * fix: nomic models using prefix correctly (#1125) * fix: nomic models using prefix correctly * chore: remove comment * fix: handling in case not torch tensor * Fix typo --------- Co-authored-by: Niklas Muennighoff <[email protected]> * 1.12.90 Automatically generated by python-semantic-release * refactor vista model wrapper to contain lib import * python 38 type hints --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: anpalmak2003 <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: Zach Nussbaum <[email protected]> Co-authored-by: chenghao xiao <[email protected]> * image memoery issues for all retrieval Abstasks * Add CLEVR and SciMMIR Image-Text Understanding tasks (#1127) * Add CLEVER and SciMMIR * Update metadata * remove useless comment * Add linting * fix typo and tests * Add CLEVR count task * add linting * add fashion200k & fashionIQ test passed * clip text max seq truncation * add WebQA, NIGHTS, OVEN * any2any retrieval chunk encoding * add nomic vision model; any2any topk bug * add cv recall * add InfoSeek; VisualNews * [MIEB] Add Stanford Cars i2i Retrieval (#1147) * wip * add results * make lint * change back the order * [MIEB] Add CUB200 i2i retrieval (#1154) * add cub200 and results * add skip_first_result * skipped self and rerun results * consolidate i2t and t2i to any2any * remove abstask and evaluators * remove references from test * tu-add berlin sketch retrieval * XM3600; XFlickr30kCO; mutilingual * wit multilingual retrieval t2i * correct multilingual t2i meta * meta * add dinov2 model; 4 sizes * cls evaluator channel bug fix * add ALIGN model * add FORBI2IRetrieval * forb & tuberlin new revision * disable tokenization parallelism * add hateful meme retrieval i2tt2i * add memotion retrieval t2ii2t * add SciMMIR Retrieval i2tt2i * ruff update * Visual STS Abstask&evaluator * add visual STS17 * add visual STS 12-16 * [mieb] Add blip and blip2 models, and ImageNetDog15Clustering task (#1226) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * [mieb] add 3 compositionality evaluation tasks (#1229) * linting & update unavailable dataset path * add aro visual relation&attribution; sugarcrepe * correct reference * add SOPI2IRetrieval dataset/task (#1232) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * change reference * Image text pair cls (#1233) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix meta data * fix validate points --------- Co-authored-by: Isaac Chung <[email protected]> * Add RP2kI2IRetrieval and METI2IRetrieval (#1239) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * [MIEB] Adding DataComp CLIP models (#1283) * adding data comp CLIP models * update model and caltech101 results * make lint * [mieb] Any2TextMultipleChoice Abstask&Evaluator & four tasks in CV-bench (#1287) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix meta data * fix validate points * CV-Bench * evaluator args comment * fix --------- Co-authored-by: Isaac Chung <[email protected]> * [mieb] adding 10 tasks (#1290) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add vidore benchmark 10 tasks * fix reference * fix old metadata * fix meta * [mieb] Adding MOCOv3 models (#1293) * add moco models first try * add as a timm model * add large model results * make lint * [mieb] Add more Any2AnyRetrieval datasets (#1285) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * remove GLDv2I2IRetrieval * [mieb] Add any2any multiple choice evaluator and abstask (and one task) (#1301) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * [mieb] Fix FORB dataset (#1306) * correct format * update results * add more results * add more results * [mieb] run tasks fix (#1302) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix e5v&vista * task type fix for running tasks * fix wrong meta * run mieb script * script * lint * align * [mieb] split RParisI2IRetrieval and ROxfordI2IRetrieval into easy, medium and hard versions (#1305) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] run tasks small fix (#1310) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix e5v&vista * task type fix for running tasks * fix wrong meta * run mieb script * script * lint * align * fix * linting * [mieb] Add VLM2vec (#1323) * wip vlm2vec model * making i2t classification work wit Calteh101 * test vlm2vec on other task types * move peft into class * feat: Merge main into MIEB (#1329) * fix: OpenAI BadRequestError by limiting input dimensions to 2048 elem… (#1203) * fix: OpenAI BadRequestError by limiting input dimensions to 2048 elements (#1201) Fix OpenAI BadRequestError by limiting input dimensions to 2048 elements - Ensure the 'sentences' list passed to OpenAI API does not exceed 2048 elements - Reference: OpenAI's Embedding API documentation on input limits Co-authored-by: Ali Shiraee <[email protected]> * fix ruff formatting * Added minor test fixes to ensure reproducility across systems * Ensure that tmp.json is not created within repo when running tests * format * fixes path issues * Rerun CI --------- Co-authored-by: HSILA <[email protected]> Co-authored-by: Ali Shiraee <[email protected]> * fix: Ensure STS pearson and spearman does not use the p-value only the correlation (#1207) Fixes #1206 * 1.14.16 Automatically generated by python-semantic-release * fix: Normalize licenses including casing, uses of "-" etc. * fix: Normalize licenses including casing, uses of "-" etc. (#1210) * fix: Normalize licenses including casing, uses of "-" etc. * fix tests * 1.14.17 Automatically generated by python-semantic-release * fix: Normalize benchmarks no only include task objects and added getter for benchmarks (#1208) * Normalize benchmarks to only include tasks - Force benchmarks to only include tasks. This fixes a few bugs where benchmarks can reference a task which is not implemented - implements `mteb.get_benchmark`, which makes it easier to fetch benchmarks - Added tests + updated docs A few outstanding issues: I would like `mteb.MTEB(benchmark)` to always reproduce the benchmark. Currently this is not possible as MTEB(eng) required the split to be specified. A solution it to allow "eval_splits) to be specified when initializing a task and then pass it on to the `load_data()`. This way we can write the following: `mteb.get_tasks(tasks=[...], eval_splits=["test"], ...)` I would also love the aggregation to be a part of the benchmark (such that it is clear how it should be aggregated). This is especially relevant for MTEB(eng) as it average the CQAD datasets before creating the global average. This way we can also create a result object for the benchmark itself. A complimenting solution for this is to allow nested benchmarks. * fix error in tests * format * Added corrections based on review * added example and formatted * 1.14.18 Automatically generated by python-semantic-release * docs: Fix broken links in docs (#1212) * Added fixes for broken links in adding_a_dataset and adding_a_model docs. * Updated link name * Mismatch of the category of AmazonPolarityClassification (#1220) Fixes #1219 * Update tasks table * fix: Ensure that results are returned even when hitting cache (#1215) Fixes #1122 * 1.14.19 Automatically generated by python-semantic-release * fix: Allow benchmark to specify eval_splits (#1217) * fix: Allow benchmark to specify eval_splits This PR allow for benchmarks to specify specific eval. splits. This allow us to fully specify a benchmark within the benchmark object. To do this it add the following: - added eval_splits to the Abstask object, which default to metadata.eval_splits - use the task.eval_splits unless overwritten in mteb.MTEB.run - added eval_splits arg to mteb.get_tasks, which filter the tasks based on splits - updated documentation - renamed the "Advanced Usage" to "Usage Documentation" to make it more accicible - added tests where relevant * Added correction based on feedback * 1.14.20 Automatically generated by python-semantic-release * Update points table * Update points table * docs: clarify adding a model (#1222) * fix: Add RepLLaMA style models (#1223) * init commit * working and reproducing * lint * update hashes * warning * add pyproject * Update points table * 1.14.21 Automatically generated by python-semantic-release * docs: Update points (#1228) * Fix case * Fix casing * Fix case * Fix case * Create 971.jsonl * Update contrib * Add contributors * Update points table * docs: Add MTEB(code) dataset (#1237) * docs: Add MTEB(code) dataset * Fix linting * Update points table * Update of my affiliation (#1242) Update points.md * Add contributor (#1243) * fix: @mrshu's name in `points.md` (#1246) * Use the diacritic character to be inline with Slovak spelling. Signed-off-by: mr.Shu <[email protected]> * docs: Create benchmarks overview table (#1245) * fix get_benchmarks method * add create benchmark script * make lint * 1.14.22 Automatically generated by python-semantic-release * docs: Update affiliation (#1247) Update points.md * Added author-information * Add final author list * Update points table * docs: Added coordination point for Jimmy Lee (#1253) docs: Added coordination point for Jimmy lee for his work on the coordination of Crystina and Nandan * Update points table * fix: Add multilingual Benchmark (#1252) * fix: Add multilingual bench * Update mteb/benchmarks/benchmarks.py Co-authored-by: Niklas Muennighoff <[email protected]> * format --------- Co-authored-by: Niklas Muennighoff <[email protected]> * 1.14.23 Automatically generated by python-semantic-release * docs: Small point changes & more contributors (#1254) * Update points.md * Fix format * Fix attribution * Update points table * fix: Downsample large retrieval datasets (#1236) * most tasks * lint * fix other issues * refactor * lint and docs * add polish * keep case sensitive mteb paths * add potential points * fix points * fix test about metadata * update tasks and stats * lint * Update points table * Update tasks table * 1.14.24 Automatically generated by python-semantic-release * fix: Get meta from CrossEncoder (#1255) * remove indent after return * handle cross encoders for model meta * make lint * update filename since we now have model name * 1.14.25 Automatically generated by python-semantic-release * fix: Add listing all available benchmarks CLI option (#1256) * add benchmarks.md in README * add cli option * add benchmark cli test case * correct typo * 1.14.26 Automatically generated by python-semantic-release * docs: Update affiliation (#1248) * Update points.md * Update points.md --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * docs: Update mteb(eng) calculation (#1258) * Update mteb(eng) calculation * Fixed citations * Update MTEB(eng) + MTEB(multilingual) * feat: leverage SentenceTransformers' query/passage specific prompts (#1221) * feat: leverage SentenceTransformer models' query/passage specific prompts * refactor: remove E5Wrapper fix: wrong e5 revisions * fix: default prompt_type to None * fix: e4ce987 revision no longer exists for multilingual-e5-small on the Hub * fix: keep `prompt_name` in kwargs when model doesn't have a `prompts` attr * feat: use Enum for `prompt_type` * docs: specify how to use prompts with Sentence Transformers * feat: readd arctic models due to metadata * 1.15.0 Automatically generated by python-semantic-release * fix: Add Touche2020v3 and JMTEB (#1262) * add datasets * fix metrics * add Touche2020v3 * fix metadata * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <[email protected]> * upd name and supress * add benchmark class --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update tasks table * 1.15.1 Automatically generated by python-semantic-release * fix: Select benchmarks CLI option (#1261) * add test case for a list of Benchmarks * add selecting benchmarks CLI option * typos * use a separate attribute for benchmarks * try fixing tests * should accept string as well * revert filename change * use Benchmark and avoid circular import * fix: derive `results_directory` path from `results_repo` name (#1275) fix: don't hardcode repo name when downloading results * 1.15.2 Automatically generated by python-semantic-release * fix: sorting benchmark tasks by MTEB, then alphabetical (#1271) * sorted * fixed formatting * efficiency changes * fix test * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * 1.15.3 Automatically generated by python-semantic-release * ci: Removed 3.8 dependency (#1281) Changes include: - remove 3.8 from tests (added 3.11 and 3.12) - changed other CI to 3.9 - updated lint rules to use 3.8 * Update points table * fix: Allow Numpy >=2.0 (#1264) Allow Numpy >=2.0 * 1.15.4 Automatically generated by python-semantic-release * docs: points for paper writing (#1286) * Create 1004.jsonl * Create 1006.jsonl * Update docs/mmteb/points/1004.jsonl * Update docs/mmteb/points/1006.jsonl --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update points table * Update points table * Update points table * docs: Fix a link in the README (#1289) * Fix a link in the README And fix some typos. * Update README.md * Update points table * fix: Update benchmarks (#1288) * make benchmark var name uppercase * update touche to v3 * add MIRACLRetrievalHardNegatives to multilingual * add mteb(indic) * add eu benchmark * 1.15.5 Automatically generated by python-semantic-release * fix: Allow numpy<2.0.0 (#1291) * 1.15.6 Automatically generated by python-semantic-release * fix: Add metadata dict to QBQTC in C-MTEB (#1292) * fix QBQTC in C-MTEB * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * 1.15.7 Automatically generated by python-semantic-release * fix: Remove non-existent eval split of CMNLI (#1294) fix eval_splits of CMNLI * 1.15.8 Automatically generated by python-semantic-release * Leaderboard (#1235) * Add leaderboard dev * Renamed MTEBResults to TaskResult * Moved model and model meta loading utilities into overview.py * Added get_model_metas to retrieve filtered metadata for models * Restructured results object and made it into a class instead of a dict * Added utilities for filtering models on BenchmarkResults objects * Added to_table utility function to BenchmarkResults * Added serialization utilities to BenchmarkResults * Attempted fixing tests * Added get_model_metas to __init__ * Added get_benchmarks to __init__ and made it return all benchmarks by default * Added get_benchmarks to __init__ * Made tasks hashable * Added task filtering based on task objects on BenchmarkResults * Added BenchmarkResults to __init__ * Added additional arguments to get_scores on two classes * Made get_scores smarter on BenchmarkResult * Added basic multilingual benchmark * Modified benchmark to be able to easily access results * Added useful properties and filtering functions to BenchmarkResults * Added minimal functioning example * Added smarter table, task-list updating and tried fixing dropdown scrolling * Made restrict_results into a private function Co-authored-by: Kenneth Enevoldsen <[email protected]> * Removed old leaderboard scripts * Hardcoded max and min model size * Removed redundant utils file * Ran linting * added leaderboard dependencies as optional * Fixed union type error on Python 3.9 * Removed references to Dict in task aggregation * Fixed name errors in _restrict_task_results * Fixed _restrict_task_results * Made hf_subsets={'default'} when the task is monolingual in _restric_task_results * Task dropdown now gets filtered based on the other criteria * Ran linting again * Introduced hotfix for reranking test * Added BenchmarkResults to __all__ in __init__ * Fixed validate_and_filter_scores method, and replaced _restric_task_results with it --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * feat: Use prompts instead of encode_corpus and encode_queries (#1278) * add prompt per task type * fix prompt * upd test * lint * fix test * fix DeprecatedSummarizationEvaluator * fix prompts * add test * lint * logger info * use task type only in model_encode * lint * update interface * add prompt types to docs * fix test * mock tasks * mock task registry * remove last task_type * fix tests * lint * fix test * fix * use wrapper and new prompts * fix tests * lint * fix test * remove conftest * validate task to prompt_name * override model prompts * task to prompt name optional * fix tests * fix models * remove task_to_prompt_name * remove from mteb __init__ * update docs * load existing model prompts if model_prompts is None * fix * lint * change wrapper loader * add wrapper class * lint * add wrapper file * update logging * upd logging * refactor reranking * lint * remove prints * 1.16.0 Automatically generated by python-semantic-release * fix: Add Retrieval SK Quad dataset for Slovak search evaluation (#1276) * Add Retrieval SK Quad dataset for Slovak search evaluation This commit introduces the Retrieval SK Quad dataset, designed to assess Slovak search performance. The dataset is derived from SK-QuAD and includes questions with their best answers categorized post-annotation. This addition provides a significant resource for advancing Slovak language search evaluation and supporting further research and development. * Add Retrieval SK Quad dataset for Slovak search evaluation 2 Added the requested changes on the SKQuadRetrieval.py file * add task to init * add missing task metadata --------- Co-authored-by: Isaac Chung <[email protected]> * Update tasks table * 1.16.1 Automatically generated by python-semantic-release * fix: Add Slovak Hate Speech and Offensive Language Dataset (#1274) * Add Slovak Hate Speech and Offensive Language Dataset This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update. * Add Slovak Hate Speech and Offensive Language Dataset - Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability. * Did requested changes: - Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability. * resolve linting issues by running `make lint` * Update tasks table * WIP: Leaderboard UI improvements (#1312) * Fixed typos in task_results * Fixed typos in task_results * Added Tailwind, reorganized layout and fixed scrolling * Ran linting * 1.16.2 Automatically generated by python-semantic-release * fix: remove duplicate multilingual * 1.16.3 Automatically generated by python-semantic-release * fix: Re-upload dataset to hub to avoid using script upload (#1322) * fix dataset upload * add linting * Update tasks table * 1.16.4 Automatically generated by python-semantic-release * fix: Add implementations of common reranker models (#1309) * init * revert * revert * add metadata * lint * add reqs * change to float16 * benchmark lint fix * 1.16.5 Automatically generated by python-semantic-release * Add multilingual mFollowIR dataset (#1308) * add mFollowIR * paper name * edit warning->info * convert to parquet * lint * Update tasks table * Cache the embeddings when requested (#1307) * add caching * update test to use close * change from json to pkl * fix for window * cleanup on Windows again * infer dimension * move cachewrapper * add wrapper * fix * updates * fix tests * fix lint * lint * add test * WIP: Leaderboard UI improvements (#1320) * Fixed typos in task_results * Fixed typos in task_results * Added Tailwind, reorganized layout and fixed scrolling * Ran linting * Removed faux benchmark * Updated layout * Changed table number format * Table highlights highest values by making them bold * Added rank to table, removed organization from model_name * Added mean rank to table * Ran linting * feat: Update metadata for all models (#1316) * Added model meta * format * fixed metadata * Metadata update for voyage models * Update mteb/models/cohere_models.py Co-authored-by: Roman Solomatin <[email protected]> * Update mteb/models/cohere_models.py Co-authored-by: Roman Solomatin <[email protected]> * Added corrections from review * fix spelling error --------- Co-authored-by: Roman Solomatin <[email protected]> * resolved bugs from pytest --collect-only * Avoid wrapping all models with the SentenceTransformerWrapper * Added normalize_embeddings_to_numpy to ensure standard embeddings during evaluations * fixed moved on correction from @Samoed * conditionally set .predict method on SentenceTransformerWrapper --------- Signed-off-by: mr.Shu <[email protected]> Co-authored-by: HSILA <[email protected]> Co-authored-by: Ali Shiraee <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Thomas van Dongen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: Orion Weller <[email protected]> Co-authored-by: John Yang <[email protected]> Co-authored-by: Imene Kerboua <[email protected]> Co-authored-by: Marek Šuppa <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Xa9aX ツ <[email protected]> Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Sathvik Nallamalli <[email protected]> Co-authored-by: Michael Graczyk <[email protected]> Co-authored-by: Mariya Hendriksen <[email protected]> Co-authored-by: Santiago Castro <[email protected]> Co-authored-by: Joey Xia <[email protected]> Co-authored-by: Márton Kardos <[email protected]> Co-authored-by: Oliver <[email protected]> * [mieb] Add OpenCLIP models (#1335) * add open clip models * Update __init__.py * lint * fix model overview * update jina clip --------- Co-authored-by: chenghao xiao <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] new version with downsampled train split to 32 per class (#1327) * new version with downsampled train split to 32 per class * force load truncated image file * make lint * add open clip models * Update __init__.py * lint * fix model overview * fix ImageCLS undersample; run birdsnap * make lint * make lint --------- Co-authored-by: chenghao xiao <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix Jina CLIP (#1349) fix jina clip v1 * fix: Add clevr license (#1356) * Add BLINK as multi-choice tasks (#1348) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] add Eva CLIP models (#1369) * add Eva CLIP models * make lint * [mieb] add siglip, cohere multimodal & some fixes for final run (#1357) * fix dataset type error * fix clustering metrics * add siglip & cohere * update mieb run script * cohere-v import * fix * api key name * [mieb] fixes for final run (#1374) * e5_v device arg * dataloader num_workers * vista doc * vista doc * run mieb * fix * Update run_vista.md * [mieb] Fix torch no grad (#1378) Fix torch no grad * [mieb] Fix vlm2vec (#1380) * fix vlm2vec return dtype * make lint * [mieb] Remove null entries from corpus of ROxford, RParis (#1371) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] fixes (#1390) * Fix torch no grad * simplify * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * [MIEB] Remove non-existent method for blip (#1394) remove non-existent method for blip * [mieb] fix ALIGN; update Winoground revision id; update run script (#1391) * fix align & winoground * lint * Convert task category to i2i for tasks that only calls image encode * update categories should include img cls, clustering, and multi label clf * no op * no op * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * [mieb] Fix open clip for cv bench count (#1397) fix shape mismatch * [mieb] Update subtasks of BLINKIT2TMultiChoice and BLINKIT2IMultiChoice (#1403) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis * fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice * update blink metadata * add updated BLINK results --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix EVA CLIP for CV Bench (#1414) * unsqueeze after preprocess * make lint * [mieb] Add calculate probs for vlm2vec (#1418) * add method * make lint * [mieb] Fix siglip bug & add retrieval datasets (#1424) * fix siglip * add edis&gld-v2 i2i * results * siglip updated results * fix siglip non-dataloader tasks * [mieb] use Logistic Regression classifier for AbsTaskImageMultilabelClassification (#1420) * use moc-lr classifier * set n_experiments=5 * run dinov2 and some laion models * add dinov2-giant results * [mieb] mieb scripts (siglip rerun & linear probing ablation & params count) (#1429) * mieb scripts * lint * [MIEB] Change Flickr30k to test split (#1449) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis * fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice * update blink metadata * add updated BLINK results * merge upstream mieb * change Flickr30k to test split * change flickr to test split --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix VLM2vec dtype (#1462) * propagate dtype * fix fuse embeddings using list of PIL images * [mieb] run script for missing results (#1472) * task type fix * scripts * [mieb] Fix Moco model on CIFAR10Clustering (#1487) Fix Moco model on CIFAR10Clustering * [mieb] Fix Flickr30k I2T and T2I (#1505) * remake flickr30k it2 and t2i * add openai clip vit-b32 b16 and jina-clip results * make lint * [MIEB] add missing siglip models (#1533) * add udpates * lint errors * fix typo (#1535) * add udpates * lint errors * fix small typo * [mieb] Fix numbers of CIRR, Fashion200k, FashionIQ, Flickr30k, MSCOCO data statistics (#1544) fix numbers * Discussing a standard for ImageEncoders * Add Voyage's multimodal embedding (#1555) * add voyage multimodal & ran 17 tasks * lint * typo * clean * [mieb] update script for final re-run (#1576) * mieb final runs * lint * fix: no longer using same query text for all of BLINKIT2TMultiChoice (#1572) * fix: no longer using same query text for all of BLINKIT2TMultiChoice * fix: remove blink subtask * fix: remove subtask from blink it2i * fix: align BLINK retrieval to multi choice * add ROxford and RParis I2I multi choice * add retrieval metrics to multi choice evaluator * fix: remove wrong negatives from revisiting multichoice datasets * fix revisiting datasets * add new results for revisiting multichoice * [MIEB] Make multimodal models compatible to `task_name` and `prompt_type` (#1583) * 1. Make `get_xxx_embeddings` follow `encode`. 2. `ImageDataset.transform` could be `None`. * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <[email protected]> * Fix arguments * Try to fix tests --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * fix image encoder (#1596) * format * fixed tests * lint * [mieb] voyage-v: add exponential backoff and other error handling (#1610) * add voyage multimodal & ran 17 tasks * lint * typo * clean * exponential backoff tmp * downsize large images for voyage api call * voyage error handling * lint * add more results * make tenacity optional * lint * log * [MIEB] Fix `get_fused_emebddings` (#1612) * Fix fused * fix vlm2vec * Fix lint * [MIEB] Add new multimodal retrieval tasks (#1611) * Add new tasks * Fix score type * [MIEB] Switch to ViDoRe BEIR version (#1607) * Fix ViDoRe corpus * fix lint * ViDoRe beir version * Extend MIEB test coverage (#1629) * add one task from each image AbsTask to test grid * add visual sts to test grid * [mieb] Task filtering by modality supported by models (#1633) * fix function signature for moco loader * filter out tasks by model modalities * correct conditions * add model meta to relevant models * use modalities instead and separate out constants * [MIEB] Fix VISTA model (#1638) Fix vista * Warn (#1639) * [mieb] model task modalities matching logic (#1640) fixing task & model modalities matching logic * [mieb] Use mock abstask classes (#1648) * rename to downsampled_dataset_transform * add mock tasks for mieb * wip getting to 57% * make lint * update mock classes to improve coverage * omit mock tasks from some tests * [MIEB] Add code for GME models (#1635) * Add GME * Fix infoseek prompts * Merge instructions * fix: add version check e5-v in mieb (#1723) * add version check for e5v model * Update e5_v.py * make lint * fix: change comparison to bigger than (#1743) change comparison to bigger than * docs: Rework MIEB docs (#1802) * combine mieb docs and move to main docs folder * make flow more coherent * tidy up * skip AfriSentiLID for now #1785 * fix typo: exclude MIEB mock tests * update vista doc * Apply suggestions from code review --------- Co-authored-by: Isaac Chung <[email protected]> * [mieb] Remove results-mieb folder (#1815) remove results-mieb folder * [mieb] fixing lrap computation for multi-label classification (#1834) multi-label cls lrap computation fix * [mieb] Merge from main (#1853) * Update tasks table * 1.19.0 Automatically generated by python-semantic-release * fix: Add the_ugly_duckling.txt for speedtask to Python wheel (#1402) Add the_ugly_duckling.txt for speedtask to Python wheel * 1.19.1 Automatically generated by python-semantic-release * fix: Added the necessary trust_remote_code (#1406) * 1.19.2 Automatically generated by python-semantic-release * docs: Update recommendation for pushing results (#1401) fix: Update recommendation for pushing results * docs: Fix a typo in README (#1430) Fix typo in readme * fix: add logging for RetrievalEvaluator NaN values for similarity scores (#1398) Fixes #1389 * 1.19.3 Automatically generated by python-semantic-release * fix: make samples_per_label a task attribute (#1419) make samples_per_label a task attr * fix: Add Korean AutoRAGRetrieval (#1388) * feat: add AutoRAG Korean embedding retrieval benchmark * fix: run --- 🧹 Running linters --- ruff format . # running ruff formatting 716 files left unchanged ruff check . --fix # running ruff linting All checks passed! * fix: add metadata for AutoRAGRetrieval * change link for markers_bm * add AutoRAGRetrieval to init.py and update metadata * add precise metadata * update metadata: description and license * delete descriptive_stats in AutoRAGRetrieval.py and run calculate_matadata_metrics.py * fix: Add missing benchmarks in benchmarks.py (#1431) Fixes #1423 * Update tasks table * 1.19.4 Automatically generated by python-semantic-release * Leaderboard 2.0: added performance x n_parameters plot + more benchmark info (#1437) * Added elementary speed/performance plot * Refactored table formatting code * Bumped Gradio version * Added more general info to benchmark description markdown block * Adjusted margin an range on plot * Made hover information easier to read on plot * Made range scaling dynamic in plot * Moved citation next to benchmark description * Made titles in benchmark info bold * Leaderboard: Fixed code benchmarks (#1441) * fixed code benchmarks * fix: Made n_parameters formatting smarter and more robust * fix: changed jina-embeddings-v3 number of parameters from 572K to 572M * fix: Fixed use_instuctions typo in model overview * fix: Fixed sentence-transformer compatibility switch * Ran linting * Added all languages, tasks, types and domains to options * Removed resetting options when a new benchmark is selected * All results now get displayed, but models that haven't been run on everything get nan values in the table * fix: Count unique texts, data leaks in calculate metrics (#1438) * add more stat * add more stat * update statistics * fix: update task metadata to allow for null (#1448) * Update tasks table * 1.19.5 Automatically generated by python-semantic-release * Fix: Made data parsing in the leaderboard figure more robust (#1450) Bugfixes with data parsing in main figure * Fixed task loading (#1451) * Fixed task result loading from disk * Fixed task result loading from disk * fix: publish (#1452) * 1.19.6 Automatically generated by python-semantic-release * fix: Fix load external results with `None` mteb_version (#1453) * fix * lint * 1.19.7 Automatically generated by python-semantic-release * WIP: Polishing up leaderboard UI (#1461) * fix: Removed column wrapping on the table, so that it remains readable * Added disclaimer to figure * fix: Added links to task info table, switched out license with metric * fix: loading pre 1.11.0 (#1460) * small fix * fix: fix * 1.19.8 Automatically generated by python-semantic-release * fix: swap touche2020 to maintain compatibility (#1469) swap touche2020 for parity * 1.19.9 Automatically generated by python-semantic-release * docs: Add sum per language for task counts (#1468) * add sum per lang * add sort by sum option * make lint * fix: pinned datasets to <3.0.0 (#1470) * 1.19.10 Automatically generated by python-semantic-release * feat: add CUREv1 retrieval dataset (#1459) * feat: add CUREv1 dataset --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> * feat: add missing domains to medical tasks * feat: modify benchmark tasks * chore: benchmark naming --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> * Update tasks table * 1.20.0 Automatically generated by python-semantic-release * fix: check if `model` attr of model exists (#1499) * check if model attr of model exists * lint * Fix retrieval evaluator * 1.20.1 Automatically generated by python-semantic-release * fix: Leaderboard demo data loading (#1507) * Made get_scores error tolerant * Added join_revisions, made get_scores failsafe * Fetching metadata fixed fr HF models * Added failsafe metadata fetching to leaderboard code * Added revision joining to leaderboard app * fix * Only show models that have metadata, when filter_models is called * Ran linting * 1.20.2 Automatically generated by python-semantic-release * fix: leaderboard only shows models that have ModelMeta (#1508) Filtering for models that have metadata * 1.20.3 Automatically generated by python-semantic-release * fix: align readme with current mteb (#1493) * align readme with current mteb * align with mieb branch * fix test * 1.20.4 Automatically generated by python-semantic-release * docs: Add lang family mapping and map to task table (#1486) * add lang family mapping and map to task table * make lint * add back some unclassified lang codes * Update tasks table * fix: Ensure that models match the names on embedding-benchmarks/results (#1519) * 1.20.5 Automatically generated by python-semantic-release * fix: Adding missing metadata on models and mathcing names up with the results repo (#1528) * Added Voyage 3 models * Added correct metadata to Cohere models and matched names with the results repo * 1.20.6 Automatically generated by python-semantic-release * feat: Evaluate missing splits (#1525) * fix: evaluate missing splits (#1268) * implement partial evaluation for missing splits * lint * requested changes done from scratch * test for missing split evaluation added * uncomment test * lint * avoid circular import * use TaskResult * skip tests for now --------- Co-authored-by: Isaac Chung <[email protected]> * got test_all_splits_evaluated passing * tests passing * address review comments * make lint * handle None cases for kg_co2_emissions * use new results info --------- Co-authored-by: Thivyanth <[email protected]> * 1.21.0 Automatically generated by python-semantic-release * fix: Correct typos superseeded -> superseded (#1532) fix typo -> superseded * 1.21.1 Automatically generated by python-semantic-release * fix: Task load data error for SICK-BR-STS and XStance (#1534) * fix task load data for two tasks * correct dataset keys * 1.21.2 Automatically generated by python-semantic-release * fix: Proprietary models now get correctly shown in leaderboard (#1530) * Fixed showing proprietary models in leaderboard * Added links to all OpenAI models * Fixed table formatting issues * Bumped Gradio version * 1.21.3 Automatically generated by python-semantic-release * docs: Add Model Meta parameters and metadata (#1536) * add multi_qa_MiniLM_L6_cos_v1 model meta * add all_mpnet_base_v2 * add parameters to model meta * make lint * add extra params to meta * fix: add more model meta (jina, e5) (#1537) * add e5 model meta * address review comments * 1.21.4 Automatically generated by python-semantic-release * Add cohere models (#1538) * fix: bug cohere names * format * fix: add nomic models (#1543) #1515 * fix: Added all-minilm-l12-v2 (#1542) #1515 * fix: Added arctic models (#1541) #1515 * fix: add sentence trimming to OpenAIWrapper (#1526) * fix: add sentence trimming to OpenAIWrapper * fix: import tiktoken library inside encode function * fix: check tokenizer library installed and update ModelMeta to pass tokenizer_name * fix: pass tokenizer_name, max_tokens to loader * fix: make tokenizer_name None for default * fix: delete changes for ModelMeta * fix: fix revision to 2 for OpenAI models * fix: add docstring for OpenAIWrapper * fix: lint * feat: add openai optional dependency set * fix: add sleep for too many requests * fix: add lint * fix: delete evaluate file * 1.21.5 Automatically generated by python-semantic-release * fix: Fixed metadata errors (#1547) * 1.21.6 Automatically generated by python-semantic-release * fix: remove curev1 from multlingual (#1552) Seems like it was added here: https://github.com/embeddings-benchmark/mteb/commit/1cc6c9e0fe62ca4e77708b641823fa1a121f048b * 1.21.7 Automatically generated by python-semantic-release * fix: Add Model2vec (#1546) * Added Model2Vec wrapper * Added Model2vec models * Added model2vec models to registry * Added model2vec as a dependency * Ran linting * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update mteb/models/model2vec_models.py Co-authored-by: Kenneth Enevoldsen <[email protected]> * Added adapted_from and superseeded_by to model2vec models. * Added missing import * Moved pyproject.toml to optional dependencies * Fixed typos * Added import error and changed model to model_name * Added Numpy to frameworks * Added Numpy to frameworks * Corrected false info on model2vec models * Replaced np.inf with maxint * Update mteb/models/mode…

* Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * 1.31.6 Automatically generated by python-semantic-release * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * fix: remove SummaryRetrieval as a type (#1915) * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * fix: revert rename and add to description (#1918) * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * Update tasks table * docs: Add sort to domains for task metadata (#1922) Tests currently go into an infinite loop. This should prevent that. * Update tasks table * 1.31.7 Automatically generated by python-semantic-release * docs: Updated citation for mteb(scandinavian) (#1914) fix: Updated citation for mteb(scandinavian) * fix: Add datasets in CodeRAG-Bench (#1595) * add three out of four datasets in CodeRAG-Bench * add verified CodeRAGStackoverflowPostsRetrieval dataset * clean up code and make some comments * fixed lint errors * addressed comments about code-rag datasets: fixed grammar and remove unnessary code and loop * roll back files which is not supposed to change * fixed the comments in split_by_first_newline() and make the methods private by adding a underscore prefix * refactor to use common args * update task descriptions * add entry in benchmarks * correct the alphanumeric order for the dataset * add in tasks.md * add in tasks.md * update task metadata * update importing path * fix lint errors * correct CodeRAG task metadata description field and id for stackoverflow-posts * fix error in test --------- Co-authored-by: Isaac Chung <[email protected]> * Update tasks table * 1.31.8 Automatically generated by python-semantic-release * Leaderboard: Acks (#1930) Add acs * misc: add warning for save_suffix removal from AbsTask (#1940) add warning for param removal * misc: add bgev1 models (#1928) * add bgev1 models * add bge-*-en * fix naming * Updated links in MTEB(eng) and eng,classic (#1948) * feat: add beir (#1933) add beir * 1.32.0 Automatically generated by python-semantic-release * Fixed join_revisions if results are empty (#1949) * feat: Merge MIEB into main 🎉 (#1944) * mieb ZeroshotClassification * mieb docs * mieb implementation demo * model meta; abstask column names; linear probe clf * model meta; abstask column names; linear probe clf * fix: update naming as candidate_labels * Update README.md * Update README.md * i2tretrieval * test load data ignore i2tretrieval * [MIEB] Add image clustering (#1088) * make lint * wip * add TinyImageNet and run * type hints * add accuracy * lint * remove unused & fix typos * T2I Retrieval * Any2AnyRetrieval * fix tests from merge * [MIEB] Add image text pair classification and tests (#1099) * add ImageTextPairClassification abstask and evaluator * dataset transform into sequence of images for each sample * fix processing logic; list of list images compatability * lint and docstrings * make lint * fix failing tests in TaskMetadata * add tests for mieb * skip gated repo --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [MIEB] Add image classification and zero shot classification tasks (#1101) * fix task metadata * use overrideable column names * add CIFAR datasets * add caltech101 dataset * add FGVC aircraft dataset * add food 101 dataset * add OxfordPets dataset * remove comments * correct cifar100 path * update cifar100 classification results * cifar zero shot results * add caltech101 zero shot * matching CLIP paper implementation * add aircraft and food zero shot * add oxford pets zero shot * [MIEB] Add CIFAR clustering (#1104) add CIFAR clustering * [MIEB] Add more image classification and zero shot classification datasets (#1103) * update category to i2t * add MNIST linear probe and zero shot * add FER2013 linear probe and zero shot * add stanford cars linear probe and zero shot * add birdsnap linear probe and zero shot * add eurosat linear probe and zero shot * lint * correct eurosat zero shot labels * add abstask for image multilable and voc2007 * make lint * [MIEB] Add more image classification and zero shot datasets (#1105) * add STL10 linear probe and zero shot * add RESISC45 linear probe and zeor shot * add Describable textures linear probe and zero shot * fix spacing lint * add SUN397 linear probe and zero shot * correct SUN397 zero shot captions * add baai bge vista * add e5-v * linting * memory issues for image linear probe & zeroshot * kknn linear probe arguments * del comments * Add some classification and ZeroShot classification tasks (#1107) * Add Country211 classification task * Add imagenet1k classification task * Add UCF101 classification task * Add PatchCamelyon Classification task * Add GTSRB classification task * Add GSTRB Zero Shot Classification * Add country211 zero shot classification * Add results for classification tasks * Add zero shot classification tasks * Add PatchCamelyon tasks and results * Add linting * Add results and fix prompts for zero shot * Add results * Add results and linting * fix dependency & clip mock test * [MIEB] Add jina clip (#1120) * add jina clip and mscoco i2t and t2i results * make lint * [MIEB] Update `mieb` with the `main` branch and some fixes (#1126) * fix instruction retrival (#1072) * fix instruction retrival * fix test * add points * make nested results * add test * skip instruction test * fix instruction passes * fix unions * move do_length_ablation Co-authored-by: Kenneth Enevoldsen <[email protected]> --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update points table * fix: fix bug-causing spelling error in function name of e5-mistral-instruct (#1106) found bug * 1.12.85 Automatically generated by python-semantic-release * fix: MultilingualSentimentClassification (#1109) * Update points table * fix: Avoid spaces in dataset name for CQADupstack and ignore speed tasks * 1.12.86 Automatically generated by python-semantic-release * fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended (#1112) * fix: Ensure that MLSUMClusteringP2P.v2 use the fast implementation as was intended * fix: fixed formatting for cli * docs: improve searchability in the advanced usage documentation * 1.12.87 Automatically generated by python-semantic-release * docs: improve searchability in the advanced usage documentation (#1113) * docs: improve searchability in the advanced usage documentation * docs: update based on corrections * fix: export type for `mteb create_meta` (#1114) * fix export type * fix dataset version too * 1.12.88 Automatically generated by python-semantic-release * fix: Simplify models implementations (#1085) * Merge * Adapt * Simplify * Check for rev again * Rmv cmmnt * Simplify * simplify * Rmv comment Co-authored-by: Kenneth Enevoldsen <[email protected]> * Use logging; change try except; add info * Lint * Rmv results * Update rev * format * Simplify models; Allow instructions * Jobs * Fix merge * Format * Adapt models * fix: ensure that e5 ignores the NQ * format --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * 1.12.89 Automatically generated by python-semantic-release * fix: nomic models using prefix correctly (#1125) * fix: nomic models using prefix correctly * chore: remove comment * fix: handling in case not torch tensor * Fix typo --------- Co-authored-by: Niklas Muennighoff <[email protected]> * 1.12.90 Automatically generated by python-semantic-release * refactor vista model wrapper to contain lib import * python 38 type hints --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Kenneth Enevoldsen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: anpalmak2003 <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: Zach Nussbaum <[email protected]> Co-authored-by: chenghao xiao <[email protected]> * image memoery issues for all retrieval Abstasks * Add CLEVR and SciMMIR Image-Text Understanding tasks (#1127) * Add CLEVER and SciMMIR * Update metadata * remove useless comment * Add linting * fix typo and tests * Add CLEVR count task * add linting * add fashion200k & fashionIQ test passed * clip text max seq truncation * add WebQA, NIGHTS, OVEN * any2any retrieval chunk encoding * add nomic vision model; any2any topk bug * add cv recall * add InfoSeek; VisualNews * [MIEB] Add Stanford Cars i2i Retrieval (#1147) * wip * add results * make lint * change back the order * [MIEB] Add CUB200 i2i retrieval (#1154) * add cub200 and results * add skip_first_result * skipped self and rerun results * consolidate i2t and t2i to any2any * remove abstask and evaluators * remove references from test * tu-add berlin sketch retrieval * XM3600; XFlickr30kCO; mutilingual * wit multilingual retrieval t2i * correct multilingual t2i meta * meta * add dinov2 model; 4 sizes * cls evaluator channel bug fix * add ALIGN model * add FORBI2IRetrieval * forb & tuberlin new revision * disable tokenization parallelism * add hateful meme retrieval i2tt2i * add memotion retrieval t2ii2t * add SciMMIR Retrieval i2tt2i * ruff update * Visual STS Abstask&evaluator * add visual STS17 * add visual STS 12-16 * [mieb] Add blip and blip2 models, and ImageNetDog15Clustering task (#1226) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * [mieb] add 3 compositionality evaluation tasks (#1229) * linting & update unavailable dataset path * add aro visual relation&attribution; sugarcrepe * correct reference * add SOPI2IRetrieval dataset/task (#1232) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * change reference * Image text pair cls (#1233) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix meta data * fix validate points --------- Co-authored-by: Isaac Chung <[email protected]> * Add RP2kI2IRetrieval and METI2IRetrieval (#1239) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * [MIEB] Adding DataComp CLIP models (#1283) * adding data comp CLIP models * update model and caltech101 results * make lint * [mieb] Any2TextMultipleChoice Abstask&Evaluator & four tasks in CV-bench (#1287) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix meta data * fix validate points * CV-Bench * evaluator args comment * fix --------- Co-authored-by: Isaac Chung <[email protected]> * [mieb] adding 10 tasks (#1290) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add vidore benchmark 10 tasks * fix reference * fix old metadata * fix meta * [mieb] Adding MOCOv3 models (#1293) * add moco models first try * add as a timm model * add large model results * make lint * [mieb] Add more Any2AnyRetrieval datasets (#1285) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * remove GLDv2I2IRetrieval * [mieb] Add any2any multiple choice evaluator and abstask (and one task) (#1301) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * [mieb] Fix FORB dataset (#1306) * correct format * update results * add more results * add more results * [mieb] run tasks fix (#1302) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix e5v&vista * task type fix for running tasks * fix wrong meta * run mieb script * script * lint * align * [mieb] split RParisI2IRetrieval and ROxfordI2IRetrieval into easy, medium and hard versions (#1305) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] run tasks small fix (#1310) * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * fix e5v&vista * task type fix for running tasks * fix wrong meta * run mieb script * script * lint * align * fix * linting * [mieb] Add VLM2vec (#1323) * wip vlm2vec model * making i2t classification work wit Calteh101 * test vlm2vec on other task types * move peft into class * feat: Merge main into MIEB (#1329) * fix: OpenAI BadRequestError by limiting input dimensions to 2048 elem… (#1203) * fix: OpenAI BadRequestError by limiting input dimensions to 2048 elements (#1201) Fix OpenAI BadRequestError by limiting input dimensions to 2048 elements - Ensure the 'sentences' list passed to OpenAI API does not exceed 2048 elements - Reference: OpenAI's Embedding API documentation on input limits Co-authored-by: Ali Shiraee <[email protected]> * fix ruff formatting * Added minor test fixes to ensure reproducility across systems * Ensure that tmp.json is not created within repo when running tests * format * fixes path issues * Rerun CI --------- Co-authored-by: HSILA <[email protected]> Co-authored-by: Ali Shiraee <[email protected]> * fix: Ensure STS pearson and spearman does not use the p-value only the correlation (#1207) Fixes #1206 * 1.14.16 Automatically generated by python-semantic-release * fix: Normalize licenses including casing, uses of "-" etc. * fix: Normalize licenses including casing, uses of "-" etc. (#1210) * fix: Normalize licenses including casing, uses of "-" etc. * fix tests * 1.14.17 Automatically generated by python-semantic-release * fix: Normalize benchmarks no only include task objects and added getter for benchmarks (#1208) * Normalize benchmarks to only include tasks - Force benchmarks to only include tasks. This fixes a few bugs where benchmarks can reference a task which is not implemented - implements `mteb.get_benchmark`, which makes it easier to fetch benchmarks - Added tests + updated docs A few outstanding issues: I would like `mteb.MTEB(benchmark)` to always reproduce the benchmark. Currently this is not possible as MTEB(eng) required the split to be specified. A solution it to allow "eval_splits) to be specified when initializing a task and then pass it on to the `load_data()`. This way we can write the following: `mteb.get_tasks(tasks=[...], eval_splits=["test"], ...)` I would also love the aggregation to be a part of the benchmark (such that it is clear how it should be aggregated). This is especially relevant for MTEB(eng) as it average the CQAD datasets before creating the global average. This way we can also create a result object for the benchmark itself. A complimenting solution for this is to allow nested benchmarks. * fix error in tests * format * Added corrections based on review * added example and formatted * 1.14.18 Automatically generated by python-semantic-release * docs: Fix broken links in docs (#1212) * Added fixes for broken links in adding_a_dataset and adding_a_model docs. * Updated link name * Mismatch of the category of AmazonPolarityClassification (#1220) Fixes #1219 * Update tasks table * fix: Ensure that results are returned even when hitting cache (#1215) Fixes #1122 * 1.14.19 Automatically generated by python-semantic-release * fix: Allow benchmark to specify eval_splits (#1217) * fix: Allow benchmark to specify eval_splits This PR allow for benchmarks to specify specific eval. splits. This allow us to fully specify a benchmark within the benchmark object. To do this it add the following: - added eval_splits to the Abstask object, which default to metadata.eval_splits - use the task.eval_splits unless overwritten in mteb.MTEB.run - added eval_splits arg to mteb.get_tasks, which filter the tasks based on splits - updated documentation - renamed the "Advanced Usage" to "Usage Documentation" to make it more accicible - added tests where relevant * Added correction based on feedback * 1.14.20 Automatically generated by python-semantic-release * Update points table * Update points table * docs: clarify adding a model (#1222) * fix: Add RepLLaMA style models (#1223) * init commit * working and reproducing * lint * update hashes * warning * add pyproject * Update points table * 1.14.21 Automatically generated by python-semantic-release * docs: Update points (#1228) * Fix case * Fix casing * Fix case * Fix case * Create 971.jsonl * Update contrib * Add contributors * Update points table * docs: Add MTEB(code) dataset (#1237) * docs: Add MTEB(code) dataset * Fix linting * Update points table * Update of my affiliation (#1242) Update points.md * Add contributor (#1243) * fix: @mrshu's name in `points.md` (#1246) * Use the diacritic character to be inline with Slovak spelling. Signed-off-by: mr.Shu <[email protected]> * docs: Create benchmarks overview table (#1245) * fix get_benchmarks method * add create benchmark script * make lint * 1.14.22 Automatically generated by python-semantic-release * docs: Update affiliation (#1247) Update points.md * Added author-information * Add final author list * Update points table * docs: Added coordination point for Jimmy Lee (#1253) docs: Added coordination point for Jimmy lee for his work on the coordination of Crystina and Nandan * Update points table * fix: Add multilingual Benchmark (#1252) * fix: Add multilingual bench * Update mteb/benchmarks/benchmarks.py Co-authored-by: Niklas Muennighoff <[email protected]> * format --------- Co-authored-by: Niklas Muennighoff <[email protected]> * 1.14.23 Automatically generated by python-semantic-release * docs: Small point changes & more contributors (#1254) * Update points.md * Fix format * Fix attribution * Update points table * fix: Downsample large retrieval datasets (#1236) * most tasks * lint * fix other issues * refactor * lint and docs * add polish * keep case sensitive mteb paths * add potential points * fix points * fix test about metadata * update tasks and stats * lint * Update points table * Update tasks table * 1.14.24 Automatically generated by python-semantic-release * fix: Get meta from CrossEncoder (#1255) * remove indent after return * handle cross encoders for model meta * make lint * update filename since we now have model name * 1.14.25 Automatically generated by python-semantic-release * fix: Add listing all available benchmarks CLI option (#1256) * add benchmarks.md in README * add cli option * add benchmark cli test case * correct typo * 1.14.26 Automatically generated by python-semantic-release * docs: Update affiliation (#1248) * Update points.md * Update points.md --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * docs: Update mteb(eng) calculation (#1258) * Update mteb(eng) calculation * Fixed citations * Update MTEB(eng) + MTEB(multilingual) * feat: leverage SentenceTransformers' query/passage specific prompts (#1221) * feat: leverage SentenceTransformer models' query/passage specific prompts * refactor: remove E5Wrapper fix: wrong e5 revisions * fix: default prompt_type to None * fix: e4ce987 revision no longer exists for multilingual-e5-small on the Hub * fix: keep `prompt_name` in kwargs when model doesn't have a `prompts` attr * feat: use Enum for `prompt_type` * docs: specify how to use prompts with Sentence Transformers * feat: readd arctic models due to metadata * 1.15.0 Automatically generated by python-semantic-release * fix: Add Touche2020v3 and JMTEB (#1262) * add datasets * fix metrics * add Touche2020v3 * fix metadata * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <[email protected]> * upd name and supress * add benchmark class --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update tasks table * 1.15.1 Automatically generated by python-semantic-release * fix: Select benchmarks CLI option (#1261) * add test case for a list of Benchmarks * add selecting benchmarks CLI option * typos * use a separate attribute for benchmarks * try fixing tests * should accept string as well * revert filename change * use Benchmark and avoid circular import * fix: derive `results_directory` path from `results_repo` name (#1275) fix: don't hardcode repo name when downloading results * 1.15.2 Automatically generated by python-semantic-release * fix: sorting benchmark tasks by MTEB, then alphabetical (#1271) * sorted * fixed formatting * efficiency changes * fix test * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * 1.15.3 Automatically generated by python-semantic-release * ci: Removed 3.8 dependency (#1281) Changes include: - remove 3.8 from tests (added 3.11 and 3.12) - changed other CI to 3.9 - updated lint rules to use 3.8 * Update points table * fix: Allow Numpy >=2.0 (#1264) Allow Numpy >=2.0 * 1.15.4 Automatically generated by python-semantic-release * docs: points for paper writing (#1286) * Create 1004.jsonl * Create 1006.jsonl * Update docs/mmteb/points/1004.jsonl * Update docs/mmteb/points/1006.jsonl --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * Update points table * Update points table * Update points table * docs: Fix a link in the README (#1289) * Fix a link in the README And fix some typos. * Update README.md * Update points table * fix: Update benchmarks (#1288) * make benchmark var name uppercase * update touche to v3 * add MIRACLRetrievalHardNegatives to multilingual * add mteb(indic) * add eu benchmark * 1.15.5 Automatically generated by python-semantic-release * fix: Allow numpy<2.0.0 (#1291) * 1.15.6 Automatically generated by python-semantic-release * fix: Add metadata dict to QBQTC in C-MTEB (#1292) * fix QBQTC in C-MTEB * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * 1.15.7 Automatically generated by python-semantic-release * fix: Remove non-existent eval split of CMNLI (#1294) fix eval_splits of CMNLI * 1.15.8 Automatically generated by python-semantic-release * Leaderboard (#1235) * Add leaderboard dev * Renamed MTEBResults to TaskResult * Moved model and model meta loading utilities into overview.py * Added get_model_metas to retrieve filtered metadata for models * Restructured results object and made it into a class instead of a dict * Added utilities for filtering models on BenchmarkResults objects * Added to_table utility function to BenchmarkResults * Added serialization utilities to BenchmarkResults * Attempted fixing tests * Added get_model_metas to __init__ * Added get_benchmarks to __init__ and made it return all benchmarks by default * Added get_benchmarks to __init__ * Made tasks hashable * Added task filtering based on task objects on BenchmarkResults * Added BenchmarkResults to __init__ * Added additional arguments to get_scores on two classes * Made get_scores smarter on BenchmarkResult * Added basic multilingual benchmark * Modified benchmark to be able to easily access results * Added useful properties and filtering functions to BenchmarkResults * Added minimal functioning example * Added smarter table, task-list updating and tried fixing dropdown scrolling * Made restrict_results into a private function Co-authored-by: Kenneth Enevoldsen <[email protected]> * Removed old leaderboard scripts * Hardcoded max and min model size * Removed redundant utils file * Ran linting * added leaderboard dependencies as optional * Fixed union type error on Python 3.9 * Removed references to Dict in task aggregation * Fixed name errors in _restrict_task_results * Fixed _restrict_task_results * Made hf_subsets={'default'} when the task is monolingual in _restric_task_results * Task dropdown now gets filtered based on the other criteria * Ran linting again * Introduced hotfix for reranking test * Added BenchmarkResults to __all__ in __init__ * Fixed validate_and_filter_scores method, and replaced _restric_task_results with it --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * feat: Use prompts instead of encode_corpus and encode_queries (#1278) * add prompt per task type * fix prompt * upd test * lint * fix test * fix DeprecatedSummarizationEvaluator * fix prompts * add test * lint * logger info * use task type only in model_encode * lint * update interface * add prompt types to docs * fix test * mock tasks * mock task registry * remove last task_type * fix tests * lint * fix test * fix * use wrapper and new prompts * fix tests * lint * fix test * remove conftest * validate task to prompt_name * override model prompts * task to prompt name optional * fix tests * fix models * remove task_to_prompt_name * remove from mteb __init__ * update docs * load existing model prompts if model_prompts is None * fix * lint * change wrapper loader * add wrapper class * lint * add wrapper file * update logging * upd logging * refactor reranking * lint * remove prints * 1.16.0 Automatically generated by python-semantic-release * fix: Add Retrieval SK Quad dataset for Slovak search evaluation (#1276) * Add Retrieval SK Quad dataset for Slovak search evaluation This commit introduces the Retrieval SK Quad dataset, designed to assess Slovak search performance. The dataset is derived from SK-QuAD and includes questions with their best answers categorized post-annotation. This addition provides a significant resource for advancing Slovak language search evaluation and supporting further research and development. * Add Retrieval SK Quad dataset for Slovak search evaluation 2 Added the requested changes on the SKQuadRetrieval.py file * add task to init * add missing task metadata --------- Co-authored-by: Isaac Chung <[email protected]> * Update tasks table * 1.16.1 Automatically generated by python-semantic-release * fix: Add Slovak Hate Speech and Offensive Language Dataset (#1274) * Add Slovak Hate Speech and Offensive Language Dataset This commit introduces the Slovak Hate Speech and Offensive Language Database to MTEB. The dataset includes posts from a social network, annotated by humans for hate speech and offensive content. Additionally, the corresponding task has been added to the tasks.md table to reflect this update. * Add Slovak Hate Speech and Offensive Language Dataset - Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability. * Did requested changes: - Updated __init__.py to include the new SlovakHateSpeechClassification task. - Modified SlovakHateSpeechClassification.py as per review suggestions to enhance functionality and readability. * resolve linting issues by running `make lint` * Update tasks table * WIP: Leaderboard UI improvements (#1312) * Fixed typos in task_results * Fixed typos in task_results * Added Tailwind, reorganized layout and fixed scrolling * Ran linting * 1.16.2 Automatically generated by python-semantic-release * fix: remove duplicate multilingual * 1.16.3 Automatically generated by python-semantic-release * fix: Re-upload dataset to hub to avoid using script upload (#1322) * fix dataset upload * add linting * Update tasks table * 1.16.4 Automatically generated by python-semantic-release * fix: Add implementations of common reranker models (#1309) * init * revert * revert * add metadata * lint * add reqs * change to float16 * benchmark lint fix * 1.16.5 Automatically generated by python-semantic-release * Add multilingual mFollowIR dataset (#1308) * add mFollowIR * paper name * edit warning->info * convert to parquet * lint * Update tasks table * Cache the embeddings when requested (#1307) * add caching * update test to use close * change from json to pkl * fix for window * cleanup on Windows again * infer dimension * move cachewrapper * add wrapper * fix * updates * fix tests * fix lint * lint * add test * WIP: Leaderboard UI improvements (#1320) * Fixed typos in task_results * Fixed typos in task_results * Added Tailwind, reorganized layout and fixed scrolling * Ran linting * Removed faux benchmark * Updated layout * Changed table number format * Table highlights highest values by making them bold * Added rank to table, removed organization from model_name * Added mean rank to table * Ran linting * feat: Update metadata for all models (#1316) * Added model meta * format * fixed metadata * Metadata update for voyage models * Update mteb/models/cohere_models.py Co-authored-by: Roman Solomatin <[email protected]> * Update mteb/models/cohere_models.py Co-authored-by: Roman Solomatin <[email protected]> * Added corrections from review * fix spelling error --------- Co-authored-by: Roman Solomatin <[email protected]> * resolved bugs from pytest --collect-only * Avoid wrapping all models with the SentenceTransformerWrapper * Added normalize_embeddings_to_numpy to ensure standard embeddings during evaluations * fixed moved on correction from @Samoed * conditionally set .predict method on SentenceTransformerWrapper --------- Signed-off-by: mr.Shu <[email protected]> Co-authored-by: HSILA <[email protected]> Co-authored-by: Ali Shiraee <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Thomas van Dongen <[email protected]> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: Niklas Muennighoff <[email protected]> Co-authored-by: Orion Weller <[email protected]> Co-authored-by: John Yang <[email protected]> Co-authored-by: Imene Kerboua <[email protected]> Co-authored-by: Marek Šuppa <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Xa9aX ツ <[email protected]> Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> Co-authored-by: Sathvik Nallamalli <[email protected]> Co-authored-by: Michael Graczyk <[email protected]> Co-authored-by: Mariya Hendriksen <[email protected]> Co-authored-by: Santiago Castro <[email protected]> Co-authored-by: Joey Xia <[email protected]> Co-authored-by: Márton Kardos <[email protected]> Co-authored-by: Oliver <[email protected]> * [mieb] Add OpenCLIP models (#1335) * add open clip models * Update __init__.py * lint * fix model overview * update jina clip --------- Co-authored-by: chenghao xiao <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] new version with downsampled train split to 32 per class (#1327) * new version with downsampled train split to 32 per class * force load truncated image file * make lint * add open clip models * Update __init__.py * lint * fix model overview * fix ImageCLS undersample; run birdsnap * make lint * make lint --------- Co-authored-by: chenghao xiao <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix Jina CLIP (#1349) fix jina clip v1 * fix: Add clevr license (#1356) * Add BLINK as multi-choice tasks (#1348) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] add Eva CLIP models (#1369) * add Eva CLIP models * make lint * [mieb] add siglip, cohere multimodal & some fixes for final run (#1357) * fix dataset type error * fix clustering metrics * add siglip & cohere * update mieb run script * cohere-v import * fix * api key name * [mieb] fixes for final run (#1374) * e5_v device arg * dataloader num_workers * vista doc * vista doc * run mieb * fix * Update run_vista.md * [mieb] Fix torch no grad (#1378) Fix torch no grad * [mieb] Fix vlm2vec (#1380) * fix vlm2vec return dtype * make lint * [mieb] Remove null entries from corpus of ROxford, RParis (#1371) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] fixes (#1390) * Fix torch no grad * simplify * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * [MIEB] Remove non-existent method for blip (#1394) remove non-existent method for blip * [mieb] fix ALIGN; update Winoground revision id; update run script (#1391) * fix align & winoground * lint * Convert task category to i2i for tasks that only calls image encode * update categories should include img cls, clustering, and multi label clf * no op * no op * make lint --------- Co-authored-by: Isaac Chung <[email protected]> * [mieb] Fix open clip for cv bench count (#1397) fix shape mismatch * [mieb] Update subtasks of BLINKIT2TMultiChoice and BLINKIT2IMultiChoice (#1403) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis * fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice * update blink metadata * add updated BLINK results --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix EVA CLIP for CV Bench (#1414) * unsqueeze after preprocess * make lint * [mieb] Add calculate probs for vlm2vec (#1418) * add method * make lint * [mieb] Fix siglip bug & add retrieval datasets (#1424) * fix siglip * add edis&gld-v2 i2i * results * siglip updated results * fix siglip non-dataloader tasks * [mieb] use Logistic Regression classifier for AbsTaskImageMultilabelClassification (#1420) * use moc-lr classifier * set n_experiments=5 * run dinov2 and some laion models * add dinov2-giant results * [mieb] mieb scripts (siglip rerun & linear probing ablation & params count) (#1429) * mieb scripts * lint * [MIEB] Change Flickr30k to test split (#1449) * wip: start adding BLIP models * add other blip variants * wip: add blip2_models.py * make lint * wip: implement blip2 wrapper * feat: add blip2 models, still mismatched names * fix: remove projections from image and text embeddings * make lint * wip: add coco BLIP2 * fix: BLIP2 better zero-shot classification without text_proj and vision_proj * tidy blip2 * add imagenet-dog-15 dataset * tidy and lint * remove unused import * add cluster_accuracy, ari and nmi to Image.ClusteringEvaluator * add imagenet-10 clustering task * add SOPI2IRetrieval * add results forclip on ImageNet10Clustering and ImageNetDog15Clustering * add SOPI2IRetrieval results for clip 32 * add results for clip vit 32/SOPI2IRetrieval * resolve conflict * add RP2kI2IRetrieval dataset * add RP2kI2IRetrieval results with clip-vit-base-patch32 * update image retrieval __init__.py * fix ImageTextPair dataloading for large datasets; more compositionality evaluation datasets * add RP2kI2IRetrieval and METI2IRetrieval * add METI2IRetreival * add SOP results * make lign * new revision for METI2IRetrieval * make lint * reset corpus chunk size * remove wrong classification import * add Flickr30k T2I and I2T * add Flickr30k T2I retriebal * reduced-size MET revision * fix: add Flickr30k T2I * make lint * add two landmark datasets and results * add Sketchy i2i retrieval * add task metadata * add BLINKIT2IRetrieval dataset * add BLINKIT2TRetrieval * add ImageCoDeT2IRetrieval * make lint * add vizwiz retrieval and results * fix vizwiz duplicate texts * add new vizwiz results * add VQA2 results * add GLD v2 I2T retrieval * add gld v2 i2i retrieval * make lint * add AbsTaskAny2AnyMultiChoice * make lint * remove GLDv2I2IRetrieval * exclude AbsTaskAny2AnyMultiChoice from test_load_data * fix e5v&vista * remove duplicate corpus entries from BLINKIT2TRetreival dataset * task type fix for running tasks * update BLINKIT2T metadata * fix wrong meta * run mieb script * split ROxford, RParis into easy, medium and hard * make lint * add BLINK as multi choice tasks * fix: license metadata in wrong format * remove null examples from corpus of ROxford and RParis * fix: add/remove subtasks from BLINKIT2IMultiChoice and BLINKIT2TMultiChoice * update blink metadata * add updated BLINK results * merge upstream mieb * change Flickr30k to test split * change flickr to test split --------- Co-authored-by: gowitheflow-1998 <[email protected]> * [mieb] Fix VLM2vec dtype (#1462) * propagate dtype * fix fuse embeddings using list of PIL images * [mieb] run script for missing results (#1472) * task type fix * scripts * [mieb] Fix Moco model on CIFAR10Clustering (#1487) Fix Moco model on CIFAR10Clustering * [mieb] Fix Flickr30k I2T and T2I (#1505) * remake flickr30k it2 and t2i * add openai clip vit-b32 b16 and jina-clip results * make lint * [MIEB] add missing siglip models (#1533) * add udpates * lint errors * fix typo (#1535) * add udpates * lint errors * fix small typo * [mieb] Fix numbers of CIRR, Fashion200k, FashionIQ, Flickr30k, MSCOCO data statistics (#1544) fix numbers * Discussing a standard for ImageEncoders * Add Voyage's multimodal embedding (#1555) * add voyage multimodal & ran 17 tasks * lint * typo * clean * [mieb] update script for final re-run (#1576) * mieb final runs * lint * fix: no longer using same query text for all of BLINKIT2TMultiChoice (#1572) * fix: no longer using same query text for all of BLINKIT2TMultiChoice * fix: remove blink subtask * fix: remove subtask from blink it2i * fix: align BLINK retrieval to multi choice * add ROxford and RParis I2I multi choice * add retrieval metrics to multi choice evaluator * fix: remove wrong negatives from revisiting multichoice datasets * fix revisiting datasets * add new results for revisiting multichoice * [MIEB] Make multimodal models compatible to `task_name` and `prompt_type` (#1583) * 1. Make `get_xxx_embeddings` follow `encode`. 2. `ImageDataset.transform` could be `None`. * Apply suggestions from code review Co-authored-by: Kenneth Enevoldsen <[email protected]> * Fix arguments * Try to fix tests --------- Co-authored-by: Kenneth Enevoldsen <[email protected]> * fix image encoder (#1596) * format * fixed tests * lint * [mieb] voyage-v: add exponential backoff and other error handling (#1610) * add voyage multimodal & ran 17 tasks * lint * typo * clean * exponential backoff tmp * downsize large images for voyage api call * voyage error handling * lint * add more results * make tenacity optional * lint * log * [MIEB] Fix `get_fused_emebddings` (#1612) * Fix fused * fix vlm2vec * Fix lint * [MIEB] Add new multimodal retrieval tasks (#1611) * Add new tasks * Fix score type * [MIEB] Switch to ViDoRe BEIR version (#1607) * Fix ViDoRe corpus * fix lint * ViDoRe beir version * Extend MIEB test coverage (#1629) * add one task from each image AbsTask to test grid * add visual sts to test grid * [mieb] Task filtering by modality supported by models (#1633) * fix function signature for moco loader * filter out tasks by model modalities * correct conditions * add model meta to relevant models * use modalities instead and separate out constants * [MIEB] Fix VISTA model (#1638) Fix vista * Warn (#1639) * [mieb] model task modalities matching logic (#1640) fixing task & model modalities matching logic * [mieb] Use mock abstask classes (#1648) * rename to downsampled_dataset_transform * add mock tasks for mieb * wip getting to 57% * make lint * update mock classes to improve coverage * omit mock tasks from some tests * [MIEB] Add code for GME models (#1635) * Add GME * Fix infoseek prompts * Merge instructions * fix: add version check e5-v in mieb (#1723) * add version check for e5v model * Update e5_v.py * make lint * fix: change comparison to bigger than (#1743) change comparison to bigger than * docs: Rework MIEB docs (#1802) * combine mieb docs and move to main docs folder * make flow more coherent * tidy up * skip AfriSentiLID for now #1785 * fix typo: exclude MIEB mock tests * update vista doc * Apply suggestions from code review --------- Co-authored-by: Isaac Chung <[email protected]> * [mieb] Remove results-mieb folder (#1815) remove results-mieb folder * [mieb] fixing lrap computation for multi-label classification (#1834) multi-label cls lrap computation fix * [mieb] Merge from main (#1853) * Update tasks table * 1.19.0 Automatically generated by python-semantic-release * fix: Add the_ugly_duckling.txt for speedtask to Python wheel (#1402) Add the_ugly_duckling.txt for speedtask to Python wheel * 1.19.1 Automatically generated by python-semantic-release * fix: Added the necessary trust_remote_code (#1406) * 1.19.2 Automatically generated by python-semantic-release * docs: Update recommendation for pushing results (#1401) fix: Update recommendation for pushing results * docs: Fix a typo in README (#1430) Fix typo in readme * fix: add logging for RetrievalEvaluator NaN values for similarity scores (#1398) Fixes #1389 * 1.19.3 Automatically generated by python-semantic-release * fix: make samples_per_label a task attribute (#1419) make samples_per_label a task attr * fix: Add Korean AutoRAGRetrieval (#1388) * feat: add AutoRAG Korean embedding retrieval benchmark * fix: run --- 🧹 Running linters --- ruff format . # running ruff formatting 716 files left unchanged ruff check . --fix # running ruff linting All checks passed! * fix: add metadata for AutoRAGRetrieval * change link for markers_bm * add AutoRAGRetrieval to init.py and update metadata * add precise metadata * update metadata: description and license * delete descriptive_stats in AutoRAGRetrieval.py and run calculate_matadata_metrics.py * fix: Add missing benchmarks in benchmarks.py (#1431) Fixes #1423 * Update tasks table * 1.19.4 Automatically generated by python-semantic-release * Leaderboard 2.0: added performance x n_parameters plot + more benchmark info (#1437) * Added elementary speed/performance plot * Refactored table formatting code * Bumped Gradio version * Added more general info to benchmark description markdown block * Adjusted margin an range on plot * Made hover information easier to read on plot * Made range scaling dynamic in plot * Moved citation next to benchmark description * Made titles in benchmark info bold * Leaderboard: Fixed code benchmarks (#1441) * fixed code benchmarks * fix: Made n_parameters formatting smarter and more robust * fix: changed jina-embeddings-v3 number of parameters from 572K to 572M * fix: Fixed use_instuctions typo in model overview * fix: Fixed sentence-transformer compatibility switch * Ran linting * Added all languages, tasks, types and domains to options * Removed resetting options when a new benchmark is selected * All results now get displayed, but models that haven't been run on everything get nan values in the table * fix: Count unique texts, data leaks in calculate metrics (#1438) * add more stat * add more stat * update statistics * fix: update task metadata to allow for null (#1448) * Update tasks table * 1.19.5 Automatically generated by python-semantic-release * Fix: Made data parsing in the leaderboard figure more robust (#1450) Bugfixes with data parsing in main figure * Fixed task loading (#1451) * Fixed task result loading from disk * Fixed task result loading from disk * fix: publish (#1452) * 1.19.6 Automatically generated by python-semantic-release * fix: Fix load external results with `None` mteb_version (#1453) * fix * lint * 1.19.7 Automatically generated by python-semantic-release * WIP: Polishing up leaderboard UI (#1461) * fix: Removed column wrapping on the table, so that it remains readable * Added disclaimer to figure * fix: Added links to task info table, switched out license with metric * fix: loading pre 1.11.0 (#1460) * small fix * fix: fix * 1.19.8 Automatically generated by python-semantic-release * fix: swap touche2020 to maintain compatibility (#1469) swap touche2020 for parity * 1.19.9 Automatically generated by python-semantic-release * docs: Add sum per language for task counts (#1468) * add sum per lang * add sort by sum option * make lint * fix: pinned datasets to <3.0.0 (#1470) * 1.19.10 Automatically generated by python-semantic-release * feat: add CUREv1 retrieval dataset (#1459) * feat: add CUREv1 dataset --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> Co-authored-by: Daniel Buades Marcos <[email protected]> * feat: add missing domains to medical tasks * feat: modify benchmark tasks * chore: benchmark naming --------- Co-authored-by: nadshe <[email protected]> Co-authored-by: olivierr42 <[email protected]> * Update tasks table * 1.20.0 Automatically generated by python-semantic-release * fix: check if `model` attr of model exists (#1499) * check if model attr of model exists * lint * Fix retrieval evaluator * 1.20.1 Automatically generated by python-semantic-release * fix: Leaderboard demo data loading (#1507) * Made get_scores error tolerant * Added join_revisions, made get_scores failsafe * Fetching metadata fixed fr HF models * Added failsafe metadata fetching to leaderboard code * Added revision joining to leaderboard app * fix * Only show models that have metadata, when filter_models is called * Ran linting * 1.20.2 Automatically generated by python-semantic-release * fix: leaderboard only shows models that have ModelMeta (#1508) Filtering for models that have metadata * 1.20.3 Automatically generated by python-semantic-release * fix: align readme with current mteb (#1493) * align readme with current mteb * align with mieb branch * fix test * 1.20.4 Automatically generated by python-semantic-release * docs: Add lang family mapping and map to task table (#1486) * add lang family mapping and map to task table * make lint * add back some unclassified lang codes * Update tasks table * fix: Ensure that models match the names on embedding-benchmarks/results (#1519) * 1.20.5 Automatically generated by python-semantic-release * fix: Adding missing metadata on models and mathcing names up with the results repo (#1528) * Added Voyage 3 models * Added correct metadata to Cohere models and matched names with the results repo * 1.20.6 Automatically generated by python-semantic-release * feat: Evaluate missing splits (#1525) * fix: evaluate missing splits (#1268) * implement partial evaluation for missing splits * lint * requested changes done from scratch * test for missing split evaluation added * uncomment test * lint * avoid circular import * use TaskResult * skip tests for now --------- Co-authored-by: Isaac Chung <[email protected]> * got test_all_splits_evaluated passing * tests passing * address review comments * make lint * handle None cases for kg_co2_emissions * use new results info --------- Co-authored-by: Thivyanth <[email protected]> * 1.21.0 Automatically generated by python-semantic-release * fix: Correct typos superseeded -> superseded (#1532) fix typo -> superseded * 1.21.1 A…

thivyanth and others added 3 commits November 27, 2024 01:38

got test_all_splits_evaluated passing

e53a359

tests passing

d1bb5a1

isaac-chung requested review from KennethEnevoldsen and Samoed November 28, 2024 21:51

isaac-chung changed the title ~~feat: Eval missing splits~~ feat: Evaluate missing splits Nov 28, 2024

Samoed reviewed Nov 28, 2024

View reviewed changes

KennethEnevoldsen approved these changes Nov 29, 2024

View reviewed changes

mteb/evaluation/MTEB.py Outdated Show resolved Hide resolved

mteb/evaluation/MTEB.py Show resolved Hide resolved

isaac-chung added 4 commits November 29, 2024 10:52

address review comments

eae6e24

make lint

f3d2f4f

handle None cases for kg_co2_emissions

47b7747

use new results info

ffd481d

isaac-chung merged commit 8e12250 into main Nov 29, 2024
10 checks passed

isaac-chung deleted the eval-missing-splits branch November 29, 2024 13:06

Samoed mentioned this pull request Dec 12, 2024

Feat: Evaluate missing languages #1584

Merged

2 tasks

isaac-chung mentioned this pull request Dec 24, 2024

Do not skip if running new split #55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Evaluate missing splits #1525

feat: Evaluate missing splits #1525

isaac-chung commented Nov 28, 2024 •

edited

Loading

Samoed left a comment

Samoed Nov 28, 2024

KennethEnevoldsen Nov 29, 2024

isaac-chung Nov 29, 2024 •

edited

Loading

Samoed Nov 29, 2024 •

edited

Loading

isaac-chung Nov 29, 2024

Samoed Nov 29, 2024

isaac-chung Nov 29, 2024

KennethEnevoldsen Dec 1, 2024

KennethEnevoldsen left a comment

isaac-chung commented Nov 29, 2024

feat: Evaluate missing splits #1525

feat: Evaluate missing splits #1525

Conversation

isaac-chung commented Nov 28, 2024 • edited Loading

Checklist

Samoed left a comment

Choose a reason for hiding this comment

Samoed Nov 28, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Nov 29, 2024

Choose a reason for hiding this comment

isaac-chung Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

Samoed Nov 29, 2024 • edited Loading

Choose a reason for hiding this comment

isaac-chung Nov 29, 2024

Choose a reason for hiding this comment

Samoed Nov 29, 2024

Choose a reason for hiding this comment

isaac-chung Nov 29, 2024

Choose a reason for hiding this comment

KennethEnevoldsen Dec 1, 2024

Choose a reason for hiding this comment

KennethEnevoldsen left a comment

Choose a reason for hiding this comment

isaac-chung commented Nov 29, 2024

isaac-chung commented Nov 28, 2024 •

edited

Loading

isaac-chung Nov 29, 2024 •

edited

Loading

Samoed Nov 29, 2024 •

edited

Loading