Skip to content

Commit

Permalink
[Draft] Add NanoBEIREvaluator (#76)
Browse files Browse the repository at this point in the history
* Initial working draft

* Adding PyLate similarity function (maxsim) and use it as the default for ColBERT models

* Remove default score function in NanoBEIR evaluator (not needed anymore)

* Remove hardcoded similarity function in the model card template

* Rename files

* fix circular important and remove duplicate code

* Ruff formating

* Remove examples including cosine

* Add model_card_template to setup

* Renaming mapping dicts

* Fixing docstrings, examples and extends NanoBEIREvaluator

* Documentation

---------

Co-authored-by: Antoine Chaffin <[email protected]>
  • Loading branch information
NohTow and Antoine Chaffin authored Jan 15, 2025
1 parent 8de1842 commit 7d9d05e
Show file tree
Hide file tree
Showing 20 changed files with 919 additions and 82 deletions.
70 changes: 70 additions & 0 deletions docs/api/evaluation/NanoBEIREvaluator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# NanoBEIREvaluator

This class evaluates the performance of a PyLate Model on the NanoBEIR collection of datasets. This is a direct extension of the NanoBEIREvaluator from the sentence-transformers library, leveraging the PyLateInformationRetrievalEvaluator class.

The collection is a set of datasets based on the BEIR collection, but with a significantly smaller size, so it can be used for quickly evaluating the retrieval performance of a model before commiting to a full evaluation. The datasets are available on HuggingFace at https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6 The Evaluator will return the same metrics as the InformationRetrievalEvaluator (i.e., MRR, nDCG, Recall@k), for each dataset and on average. Examples -------- >>> from pylate import models, evaluation >>> model = models.ColBERT(model_name_or_path="lightonai/colbertv2.0") >>> datasets = ["SciFact"] >>> evaluator = evaluation.NanoBEIREvaluator(dataset_names=datasets) >>> results = evaluator(model) >>> results {'NanoSciFact_MaxSim_accuracy@1': 0.62, 'NanoSciFact_MaxSim_accuracy@3': 0.74, 'NanoSciFact_MaxSim_accuracy@5': 0.8, 'NanoSciFact_MaxSim_accuracy@10': 0.86, 'NanoSciFact_MaxSim_precision@1': np.float64(0.62), 'NanoSciFact_MaxSim_precision@3': np.float64(0.26666666666666666), 'NanoSciFact_MaxSim_precision@5': np.float64(0.18), 'NanoSciFact_MaxSim_precision@10': np.float64(0.096), 'NanoSciFact_MaxSim_recall@1': np.float64(0.595), 'NanoSciFact_MaxSim_recall@3': np.float64(0.715), 'NanoSciFact_MaxSim_recall@5': np.float64(0.79), 'NanoSciFact_MaxSim_recall@10': np.float64(0.85), 'NanoSciFact_MaxSim_ndcg@10': np.float64(0.7279903941189909), 'NanoSciFact_MaxSim_mrr@10': 0.6912222222222222, 'NanoSciFact_MaxSim_map@100': np.float64(0.6903374780806633), 'NanoBEIR_mean_MaxSim_accuracy@1': np.float64(0.62), 'NanoBEIR_mean_MaxSim_accuracy@3': np.float64(0.74), 'NanoBEIR_mean_MaxSim_accuracy@5': np.float64(0.8), 'NanoBEIR_mean_MaxSim_accuracy@10': np.float64(0.86), 'NanoBEIR_mean_MaxSim_precision@1': np.float64(0.62), 'NanoBEIR_mean_MaxSim_precision@3': np.float64(0.26666666666666666), 'NanoBEIR_mean_MaxSim_precision@5': np.float64(0.18), 'NanoBEIR_mean_MaxSim_precision@10': np.float64(0.096), 'NanoBEIR_mean_MaxSim_recall@1': np.float64(0.595), 'NanoBEIR_mean_MaxSim_recall@3': np.float64(0.715), 'NanoBEIR_mean_MaxSim_recall@5': np.float64(0.79), 'NanoBEIR_mean_MaxSim_recall@10': np.float64(0.85), 'NanoBEIR_mean_MaxSim_ndcg@10': np.float64(0.7279903941189909), 'NanoBEIR_mean_MaxSim_mrr@10': np.float64(0.6912222222222222), 'NanoBEIR_mean_MaxSim_map@100': np.float64(0.6903374780806633)}

## Parameters

- **dataset_names** (*'list[DatasetNameType] | None'*) – defaults to `None`

- **mrr_at_k** (*'list[int]'*) – defaults to `[10]`

- **ndcg_at_k** (*'list[int]'*) – defaults to `[10]`

- **accuracy_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]`

- **precision_recall_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]`

- **map_at_k** (*'list[int]'*) – defaults to `[100]`

- **show_progress_bar** (*'bool'*) – defaults to `False`

- **batch_size** (*'int'*) – defaults to `32`

- **write_csv** (*'bool'*) – defaults to `True`

- **truncate_dim** (*'int | None'*) – defaults to `None`

- **score_functions** (*'dict[str, Callable[[Tensor, Tensor], Tensor]]'*) – defaults to `None`

- **main_score_function** (*'str | SimilarityFunction | None'*) – defaults to `None`

- **aggregate_fn** (*'Callable[[list[float]], float]'*) – defaults to `<function mean at 0x7fe7b9322480>`

- **aggregate_key** (*'str'*) – defaults to `mean`

- **query_prompts** (*'str | dict[str, str] | None'*) – defaults to `None`

- **corpus_prompts** (*'str | dict[str, str] | None'*) – defaults to `None`


## Attributes

- **description**

Returns a human-readable description of the evaluator: BinaryClassificationEvaluator -> Binary Classification 1. Remove "Evaluator" from the class name 2. Add a space before every capital letter



## Methods

???- note "__call__"

This is called during training to evaluate the model. It returns a score for the evaluation with a higher score indicating a better result.

Args: model: the model to evaluate output_path: path where predictions and metrics are written to epoch: the epoch where the evaluation takes place. This is used for the file prefixes. If this is -1, then we assume evaluation on test data. steps: the steps in the current epoch at time of the evaluation. This is used for the file prefixes. If this is -1, then we assume evaluation at the end of the epoch. Returns: Either a score for the evaluation with a higher score indicating a better result, or a dictionary with scores. If the latter is chosen, then `evaluator.primary_metric` must be defined

**Parameters**

- **model** (*'SentenceTransformer'*)
- **output_path** (*'str'*) – defaults to `None`
- **epoch** (*'int'*) – defaults to `-1`
- **steps** (*'int'*) – defaults to `-1`
- **args**
- **kwargs**

???- note "prefix_name_to_metrics"

???- note "store_metrics_in_model_card_data"

86 changes: 86 additions & 0 deletions docs/api/evaluation/PyLateInformationRetrievalEvaluator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# PyLateInformationRetrievalEvaluator

This class evaluates an Information Retrieval (IR) setting. This is a direct extension of the InformationRetrievalEvaluator from the sentence-transformers library, only override the compute_metrices method to be compilatible with PyLate models (define assymetric encoding using is_query params and add padding).



## Parameters

- **queries** (*'dict[str, str]'*)

- **corpus** (*'dict[str, str]'*)

- **relevant_docs** (*'dict[str, set[str]]'*)

- **corpus_chunk_size** (*'int'*) – defaults to `50000`

- **mrr_at_k** (*'list[int]'*) – defaults to `[10]`

- **ndcg_at_k** (*'list[int]'*) – defaults to `[10]`

- **accuracy_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]`

- **precision_recall_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]`

- **map_at_k** (*'list[int]'*) – defaults to `[100]`

- **show_progress_bar** (*'bool'*) – defaults to `False`

- **batch_size** (*'int'*) – defaults to `32`

- **name** (*'str'*) – defaults to ``

- **write_csv** (*'bool'*) – defaults to `True`

- **truncate_dim** (*'int | None'*) – defaults to `None`

- **score_functions** (*'dict[str, Callable[[Tensor, Tensor], Tensor]] | None'*) – defaults to `None`

- **main_score_function** (*'str | SimilarityFunction | None'*) – defaults to `None`

- **query_prompt** (*'str | None'*) – defaults to `None`

- **query_prompt_name** (*'str | None'*) – defaults to `None`

- **corpus_prompt** (*'str | None'*) – defaults to `None`

- **corpus_prompt_name** (*'str | None'*) – defaults to `None`


## Attributes

- **description**

Returns a human-readable description of the evaluator: BinaryClassificationEvaluator -> Binary Classification 1. Remove "Evaluator" from the class name 2. Add a space before every capital letter



## Methods

???- note "__call__"

This is called during training to evaluate the model. It returns a score for the evaluation with a higher score indicating a better result.

Args: model: the model to evaluate output_path: path where predictions and metrics are written to epoch: the epoch where the evaluation takes place. This is used for the file prefixes. If this is -1, then we assume evaluation on test data. steps: the steps in the current epoch at time of the evaluation. This is used for the file prefixes. If this is -1, then we assume evaluation at the end of the epoch. Returns: Either a score for the evaluation with a higher score indicating a better result, or a dictionary with scores. If the latter is chosen, then `evaluator.primary_metric` must be defined

**Parameters**

- **model** (*'SentenceTransformer'*)
- **output_path** (*'str'*) – defaults to `None`
- **epoch** (*'int'*) – defaults to `-1`
- **steps** (*'int'*) – defaults to `-1`
- **args**
- **kwargs**

???- note "compute_dcg_at_k"

???- note "compute_metrices"

???- note "compute_metrics"

???- note "output_scores"

???- note "prefix_name_to_metrics"

???- note "store_metrics_in_model_card_data"

1 change: 1 addition & 0 deletions docs/api/hf-hub/.pages
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
title: hf_hub
155 changes: 155 additions & 0 deletions docs/api/hf-hub/PylateModelCardData.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
# PylateModelCardData

A dataclass for storing data used in the model card.



## Parameters

- **language** (*'str | list[str] | None'*) – defaults to `<factory>`

The model language, either a string or a list of strings, e.g., "en" or ["en", "de", "nl"].

- **license** (*'str | None'*) – defaults to `None`

The license of the model, e.g., "apache-2.0", "mit", or "cc-by-nc-sa-4.0".

- **model_name** (*'str | None'*) – defaults to `None`

The pretty name of the model, e.g., "SentenceTransformer based on microsoft/mpnet-base".

- **model_id** (*'str | None'*) – defaults to `None`

The model ID for pushing the model to the Hub, e.g., "tomaarsen/sbert-mpnet-base-allnli".

- **train_datasets** (*'list[dict[str, str]]'*) – defaults to `<factory>`

A list of dictionaries containing names and/or Hugging Face dataset IDs for training datasets, e.g., [{"name": "SNLI", "id": "stanfordnlp/snli"}, {"name": "MultiNLI", "id": "nyu-mll/multi_nli"}, {"name": "STSB"}].

- **eval_datasets** (*'list[dict[str, str]]'*) – defaults to `<factory>`

A list of dictionaries containing names and/or Hugging Face dataset IDs for evaluation datasets, e.g., [{"name": "SNLI", "id": "stanfordnlp/snli"}, {"id": "mteb/stsbenchmark-sts"}].

- **task_name** (*'str'*) – defaults to `semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more`

The human-readable task the model is trained on, e.g., "semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more".

- **tags** (*'list[str] | None'*) – defaults to `<factory>`

A list of tags for the model, e.g., ["sentence-transformers", "sentence-similarity", "feature-extraction"].

- **generate_widget_examples** (*"Literal['deprecated']"*) – defaults to `deprecated`


## Attributes

- **base_model**

- **base_model_revision**

- **best_model_step**

- **code_carbon_callback**

- **license**

- **model**

- **model_id**

- **model_name**

- **predict_example**

- **trainer**



## Methods

???- note "add_tags"

???- note "compute_dataset_metrics"

Given a dataset, compute the following: * Dataset Size * Dataset Columns * Dataset Stats - Strings: min, mean, max word count/token length - Integers: Counter() instance - Floats: min, mean, max range - List: number of elements or min, mean, max number of elements * 3 Example samples * Loss function name - Loss function config

**Parameters**

- **dataset** (*'Dataset | IterableDataset | None'*)
- **dataset_info** (*'dict[str, Any]'*)
- **loss** (*'dict[str, nn.Module] | nn.Module | None'*)

???- note "extract_dataset_metadata"

???- note "format_eval_metrics"

Format the evaluation metrics for the model card.

The following keys will be returned: - eval_metrics: A list of dictionaries containing the class name, description, dataset name, and a markdown table This is used to display the evaluation metrics in the model card. - metrics: A list of all metric keys. This is used in the model card metadata. - model-index: A list of dictionaries containing the task name, task type, dataset type, dataset name, metric name, metric type, and metric value. This is used to display the evaluation metrics in the model card metadata.


???- note "format_training_logs"

???- note "get"

Get value for a given metadata key.

**Parameters**

- **key** (*str*)
- **default** (*Any*) – defaults to `None`

???- note "get_codecarbon_data"

???- note "infer_datasets"

???- note "pop"

Pop value for a given metadata key.

**Parameters**

- **key** (*str*)
- **default** (*Any*) – defaults to `None`

???- note "register_model"

???- note "set_base_model"

???- note "set_best_model_step"

???- note "set_evaluation_metrics"

???- note "set_label_examples"

???- note "set_language"

???- note "set_license"

???- note "set_losses"

???- note "set_model_id"

???- note "set_widget_examples"

???- note "to_dict"

Converts CardData to a dict.

Returns: `dict`: CardData represented as a dictionary ready to be dumped to a YAML block for inclusion in a README.md file.


???- note "to_yaml"

Dumps CardData to a YAML block for inclusion in a README.md file.

Args: line_break (str, *optional*): The line break to use when dumping to yaml. Returns: `str`: CardData represented as a YAML block.

**Parameters**

- **line_break** – defaults to `None`

???- note "try_to_set_base_model"

???- note "validate_datasets"

4 changes: 2 additions & 2 deletions docs/api/losses/Contrastive.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Contrastive loss. Expects as input two texts and a label of either 0 or 1. If th

ColBERT model.

- **score_metric** – defaults to `<function colbert_scores at 0x7fc43e97f7e0>`
- **score_metric** – defaults to `<function colbert_scores at 0x7fe76c02f240>`

ColBERT scoring function. Defaults to colbert_scores.

Expand Down Expand Up @@ -228,7 +228,7 @@ Contrastive loss. Expects as input two texts and a label of either 0 or 1. If th

Copy parameters and buffers from :attr:`state_dict` into this module and its descendants.

If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing any keys that are expected by this module but missing from the provided ``state_dict``. * **unexpected_keys** is a list of str containing the keys that are not expected by this module but present in the provided ``state_dict``. Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``.
If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. .. warning:: If :attr:`assign` is ``True`` the optimizer must be created after the call to :attr:`load_state_dict` unless :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``. Args: state_dict (dict): a dict containing parameters and persistent buffers. strict (bool, optional): whether to strictly enforce that the keys in :attr:`state_dict` match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function. Default: ``True`` assign (bool, optional): When ``False``, the properties of the tensors in the current module are preserved while when ``True``, the properties of the Tensors in the state dict are preserved. The only exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s for which the value from the module is preserved. Default: ``False`` Returns: ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields: * **missing_keys** is a list of str containing the missing keys * **unexpected_keys** is a list of str containing the unexpected keys Note: If a parameter or buffer is registered as ``None`` and its corresponding key exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a ``RuntimeError``.

**Parameters**

Expand Down
Loading

0 comments on commit 7d9d05e

Please sign in to comment.