[Draft] Add NanoBEIREvaluator (#76)

* Initial working draft * Adding PyLate similarity function (maxsim) and use it as the default for ColBERT models * Remove default score function in NanoBEIR evaluator (not needed anymore) * Remove hardcoded similarity function in the model card template * Rename files * fix circular important and remove duplicate code * Ruff formating * Remove examples including cosine * Add model_card_template to setup * Renaming mapping dicts * Fixing docstrings, examples and extends NanoBEIREvaluator * Documentation --------- Co-authored-by: Antoine Chaffin <[email protected]>
lightonai · Jan 15, 2025 · 7d9d05e · 7d9d05e
1 parent 8de1842
commit 7d9d05e
Show file tree

Hide file tree

Showing 20 changed files with 919 additions and 82 deletions.
diff --git a/docs/api/evaluation/NanoBEIREvaluator.md b/docs/api/evaluation/NanoBEIREvaluator.md
@@ -0,0 +1,70 @@
+# NanoBEIREvaluator
+
+This class evaluates the performance of a PyLate Model on the NanoBEIR collection of datasets. This is a direct extension of the NanoBEIREvaluator from the sentence-transformers library, leveraging the PyLateInformationRetrievalEvaluator class.
+
+The collection is a set of datasets based on the BEIR collection, but with a significantly smaller size, so it can be used for quickly evaluating the retrieval performance of a model before commiting to a full evaluation. The datasets are available on HuggingFace at https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6 The Evaluator will return the same metrics as the InformationRetrievalEvaluator (i.e., MRR, nDCG, Recall@k), for each dataset and on average. Examples -------- >>> from pylate import models, evaluation >>> model = models.ColBERT(model_name_or_path="lightonai/colbertv2.0") >>> datasets = ["SciFact"] >>> evaluator = evaluation.NanoBEIREvaluator(dataset_names=datasets) >>> results = evaluator(model) >>> results {'NanoSciFact_MaxSim_accuracy@1': 0.62, 'NanoSciFact_MaxSim_accuracy@3': 0.74, 'NanoSciFact_MaxSim_accuracy@5': 0.8, 'NanoSciFact_MaxSim_accuracy@10': 0.86, 'NanoSciFact_MaxSim_precision@1': np.float64(0.62), 'NanoSciFact_MaxSim_precision@3': np.float64(0.26666666666666666), 'NanoSciFact_MaxSim_precision@5': np.float64(0.18), 'NanoSciFact_MaxSim_precision@10': np.float64(0.096), 'NanoSciFact_MaxSim_recall@1': np.float64(0.595), 'NanoSciFact_MaxSim_recall@3': np.float64(0.715), 'NanoSciFact_MaxSim_recall@5': np.float64(0.79), 'NanoSciFact_MaxSim_recall@10': np.float64(0.85), 'NanoSciFact_MaxSim_ndcg@10': np.float64(0.7279903941189909), 'NanoSciFact_MaxSim_mrr@10': 0.6912222222222222, 'NanoSciFact_MaxSim_map@100': np.float64(0.6903374780806633), 'NanoBEIR_mean_MaxSim_accuracy@1': np.float64(0.62), 'NanoBEIR_mean_MaxSim_accuracy@3': np.float64(0.74), 'NanoBEIR_mean_MaxSim_accuracy@5': np.float64(0.8), 'NanoBEIR_mean_MaxSim_accuracy@10': np.float64(0.86), 'NanoBEIR_mean_MaxSim_precision@1': np.float64(0.62), 'NanoBEIR_mean_MaxSim_precision@3': np.float64(0.26666666666666666), 'NanoBEIR_mean_MaxSim_precision@5': np.float64(0.18), 'NanoBEIR_mean_MaxSim_precision@10': np.float64(0.096), 'NanoBEIR_mean_MaxSim_recall@1': np.float64(0.595), 'NanoBEIR_mean_MaxSim_recall@3': np.float64(0.715), 'NanoBEIR_mean_MaxSim_recall@5': np.float64(0.79), 'NanoBEIR_mean_MaxSim_recall@10': np.float64(0.85), 'NanoBEIR_mean_MaxSim_ndcg@10': np.float64(0.7279903941189909), 'NanoBEIR_mean_MaxSim_mrr@10': np.float64(0.6912222222222222), 'NanoBEIR_mean_MaxSim_map@100': np.float64(0.6903374780806633)}
+
+## Parameters
+
+- **dataset_names** (*'list[DatasetNameType] | None'*) – defaults to `None`
+
+- **mrr_at_k** (*'list[int]'*) – defaults to `[10]`
+
+- **ndcg_at_k** (*'list[int]'*) – defaults to `[10]`
+
+- **accuracy_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]`
+
+- **precision_recall_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]`
+
+- **map_at_k** (*'list[int]'*) – defaults to `[100]`
+
+- **show_progress_bar** (*'bool'*) – defaults to `False`
+
+- **batch_size** (*'int'*) – defaults to `32`
+
+- **write_csv** (*'bool'*) – defaults to `True`
+
+- **truncate_dim** (*'int | None'*) – defaults to `None`
+
+- **score_functions** (*'dict[str, Callable[[Tensor, Tensor], Tensor]]'*) – defaults to `None`
+
+- **main_score_function** (*'str | SimilarityFunction | None'*) – defaults to `None`
+
+- **aggregate_fn** (*'Callable[[list[float]], float]'*) – defaults to `<function mean at 0x7fe7b9322480>`
+
+- **aggregate_key** (*'str'*) – defaults to `mean`
+
+- **query_prompts** (*'str | dict[str, str] | None'*) – defaults to `None`
+
+- **corpus_prompts** (*'str | dict[str, str] | None'*) – defaults to `None`
+
+
+## Attributes
+
+- **description**
+
+    Returns a human-readable description of the evaluator: BinaryClassificationEvaluator -> Binary Classification  1. Remove "Evaluator" from the class name 2. Add a space before every capital letter
+
+
+
+## Methods
+
+???- note "__call__"
+
+    This is called during training to evaluate the model. It returns a score for the evaluation with a higher score indicating a better result.
+
+    Args:     model: the model to evaluate     output_path: path where predictions and metrics are written         to     epoch: the epoch where the evaluation takes place. This is         used for the file prefixes. If this is -1, then we         assume evaluation on test data.     steps: the steps in the current epoch at time of the         evaluation. This is used for the file prefixes. If this         is -1, then we assume evaluation at the end of the         epoch.  Returns:     Either a score for the evaluation with a higher score     indicating a better result, or a dictionary with scores. If     the latter is chosen, then `evaluator.primary_metric` must     be defined
+
+    **Parameters**
+
+    - **model**     (*'SentenceTransformer'*)    
+    - **output_path**     (*'str'*)     – defaults to `None`    
+    - **epoch**     (*'int'*)     – defaults to `-1`    
+    - **steps**     (*'int'*)     – defaults to `-1`    
+    - **args**    
+    - **kwargs**    
+
+???- note "prefix_name_to_metrics"
+
+???- note "store_metrics_in_model_card_data"
+
diff --git a/docs/api/evaluation/PyLateInformationRetrievalEvaluator.md b/docs/api/evaluation/PyLateInformationRetrievalEvaluator.md
@@ -0,0 +1,86 @@
+# PyLateInformationRetrievalEvaluator
+
+This class evaluates an Information Retrieval (IR) setting. This is a direct extension of the InformationRetrievalEvaluator from the sentence-transformers library, only override the compute_metrices method to be compilatible with PyLate models (define assymetric encoding using is_query params and add padding).
+
+
+
+## Parameters
+
+- **queries** (*'dict[str, str]'*)
+
+- **corpus** (*'dict[str, str]'*)
+
+- **relevant_docs** (*'dict[str, set[str]]'*)
+
+- **corpus_chunk_size** (*'int'*) – defaults to `50000`
+
+- **mrr_at_k** (*'list[int]'*) – defaults to `[10]`
+
+- **ndcg_at_k** (*'list[int]'*) – defaults to `[10]`
+
+- **accuracy_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]`
+
+- **precision_recall_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]`
+
+- **map_at_k** (*'list[int]'*) – defaults to `[100]`
+
+- **show_progress_bar** (*'bool'*) – defaults to `False`
+
+- **batch_size** (*'int'*) – defaults to `32`
+
+- **name** (*'str'*) – defaults to ``
+
+- **write_csv** (*'bool'*) – defaults to `True`
+
+- **truncate_dim** (*'int | None'*) – defaults to `None`
+
+- **score_functions** (*'dict[str, Callable[[Tensor, Tensor], Tensor]] | None'*) – defaults to `None`
+
+- **main_score_function** (*'str | SimilarityFunction | None'*) – defaults to `None`
+
+- **query_prompt** (*'str | None'*) – defaults to `None`
+
+- **query_prompt_name** (*'str | None'*) – defaults to `None`
+
+- **corpus_prompt** (*'str | None'*) – defaults to `None`
+
+- **corpus_prompt_name** (*'str | None'*) – defaults to `None`
+
+
+## Attributes
+
+- **description**
+
+    Returns a human-readable description of the evaluator: BinaryClassificationEvaluator -> Binary Classification  1. Remove "Evaluator" from the class name 2. Add a space before every capital letter
+
+
+
+## Methods
+
+???- note "__call__"
+
+    This is called during training to evaluate the model. It returns a score for the evaluation with a higher score indicating a better result.
+
+    Args:     model: the model to evaluate     output_path: path where predictions and metrics are written         to     epoch: the epoch where the evaluation takes place. This is         used for the file prefixes. If this is -1, then we         assume evaluation on test data.     steps: the steps in the current epoch at time of the         evaluation. This is used for the file prefixes. If this         is -1, then we assume evaluation at the end of the         epoch.  Returns:     Either a score for the evaluation with a higher score     indicating a better result, or a dictionary with scores. If     the latter is chosen, then `evaluator.primary_metric` must     be defined
+
+    **Parameters**
+
+    - **model**     (*'SentenceTransformer'*)    
+    - **output_path**     (*'str'*)     – defaults to `None`    
+    - **epoch**     (*'int'*)     – defaults to `-1`    
+    - **steps**     (*'int'*)     – defaults to `-1`    
+    - **args**    
+    - **kwargs**    
+
+???- note "compute_dcg_at_k"
+
+???- note "compute_metrices"
+
+???- note "compute_metrics"
+
+???- note "output_scores"
+
+???- note "prefix_name_to_metrics"
+
+???- note "store_metrics_in_model_card_data"
+
diff --git a/docs/api/hf-hub/.pages b/docs/api/hf-hub/.pages
@@ -0,0 +1 @@
+title: hf_hub
diff --git a/docs/api/hf-hub/PylateModelCardData.md b/docs/api/hf-hub/PylateModelCardData.md
@@ -0,0 +1,155 @@
+# PylateModelCardData
+
+A dataclass for storing data used in the model card.
+
+
+
+## Parameters
+
+- **language** (*'str | list[str] | None'*) – defaults to `<factory>`
+
+    The model language, either a string or a list of strings, e.g., "en" or ["en", "de", "nl"].
+
+- **license** (*'str | None'*) – defaults to `None`
+
+    The license of the model, e.g., "apache-2.0", "mit", or "cc-by-nc-sa-4.0".
+
+- **model_name** (*'str | None'*) – defaults to `None`
+
+    The pretty name of the model, e.g., "SentenceTransformer based on microsoft/mpnet-base".
+
+- **model_id** (*'str | None'*) – defaults to `None`
+
+    The model ID for pushing the model to the Hub, e.g., "tomaarsen/sbert-mpnet-base-allnli".
+
+- **train_datasets** (*'list[dict[str, str]]'*) – defaults to `<factory>`
+
+    A list of dictionaries containing names and/or Hugging Face dataset IDs for training datasets, e.g., [{"name": "SNLI", "id": "stanfordnlp/snli"}, {"name": "MultiNLI", "id": "nyu-mll/multi_nli"}, {"name": "STSB"}].
+
+- **eval_datasets** (*'list[dict[str, str]]'*) – defaults to `<factory>`
+
+    A list of dictionaries containing names and/or Hugging Face dataset IDs for evaluation datasets, e.g., [{"name": "SNLI", "id": "stanfordnlp/snli"}, {"id": "mteb/stsbenchmark-sts"}].
+
+- **task_name** (*'str'*) – defaults to `semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more`
+
+    The human-readable task the model is trained on, e.g., "semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more".
+
+- **tags** (*'list[str] | None'*) – defaults to `<factory>`
+
+    A list of tags for the model, e.g., ["sentence-transformers", "sentence-similarity", "feature-extraction"].
+
+- **generate_widget_examples** (*"Literal['deprecated']"*) – defaults to `deprecated`
+
+
+## Attributes
+
+- **base_model**
+
+- **base_model_revision**
+
+- **best_model_step**
+
+- **code_carbon_callback**
+
+- **license**
+
+- **model**
+
+- **model_id**
+
+- **model_name**
+
+- **predict_example**
+
+- **trainer**
+
+
+
+## Methods
+
+???- note "add_tags"
+
+???- note "compute_dataset_metrics"
+
+    Given a dataset, compute the following: * Dataset Size * Dataset Columns * Dataset Stats     - Strings: min, mean, max word count/token length     - Integers: Counter() instance     - Floats: min, mean, max range     - List: number of elements or min, mean, max number of elements * 3 Example samples * Loss function name     - Loss function config
+
+    **Parameters**
+
+    - **dataset**     (*'Dataset | IterableDataset | None'*)    
+    - **dataset_info**     (*'dict[str, Any]'*)    
+    - **loss**     (*'dict[str, nn.Module] | nn.Module | None'*)    
+
+???- note "extract_dataset_metadata"
+
+???- note "format_eval_metrics"
+
+    Format the evaluation metrics for the model card.
+
+    The following keys will be returned: - eval_metrics: A list of dictionaries containing the class name, description, dataset name, and a markdown table   This is used to display the evaluation metrics in the model card. - metrics: A list of all metric keys. This is used in the model card metadata. - model-index: A list of dictionaries containing the task name, task type, dataset type, dataset name, metric name,   metric type, and metric value. This is used to display the evaluation metrics in the model card metadata.
+
+
+???- note "format_training_logs"
+
+???- note "get"
+
+    Get value for a given metadata key.
+
+    **Parameters**
+
+    - **key**     (*str*)    
+    - **default**     (*Any*)     – defaults to `None`    
+
+???- note "get_codecarbon_data"
+
+???- note "infer_datasets"
+
+???- note "pop"
+
+    Pop value for a given metadata key.
+
+    **Parameters**
+
+    - **key**     (*str*)    
+    - **default**     (*Any*)     – defaults to `None`    
+
+???- note "register_model"
+
+???- note "set_base_model"
+
+???- note "set_best_model_step"
+
+???- note "set_evaluation_metrics"
+
+???- note "set_label_examples"
+
+???- note "set_language"
+
+???- note "set_license"
+
+???- note "set_losses"
+
+???- note "set_model_id"
+
+???- note "set_widget_examples"
+
+???- note "to_dict"
+
+    Converts CardData to a dict.
+
+    Returns:     `dict`: CardData represented as a dictionary ready to be dumped to a YAML     block for inclusion in a README.md file.
+
+
+???- note "to_yaml"
+
+    Dumps CardData to a YAML block for inclusion in a README.md file.
+
+    Args:     line_break (str, *optional*):         The line break to use when dumping to yaml.  Returns:     `str`: CardData represented as a YAML block.
+
+    **Parameters**
+
+    - **line_break**     – defaults to `None`    
+
+???- note "try_to_set_base_model"
+
+???- note "validate_datasets"
+
diff --git a/docs/api/losses/Contrastive.md b/docs/api/losses/Contrastive.md
@@ -10,7 +10,7 @@ Contrastive loss. Expects as input two texts and a label of either 0 or 1. If th
 
     ColBERT model.
 
-- **score_metric** – defaults to `<function colbert_scores at 0x7fc43e97f7e0>`
+- **score_metric** – defaults to `<function colbert_scores at 0x7fe76c02f240>`
 
     ColBERT scoring function. Defaults to colbert_scores.
 
@@ -228,7 +228,7 @@ Contrastive loss. Expects as input two texts and a label of either 0 or 1. If th
 
     Copy parameters and buffers from :attr:`state_dict` into this module and its descendants.
 
-    If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function.  .. warning::     If :attr:`assign` is ``True`` the optimizer must be created after     the call to :attr:`load_state_dict` unless     :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``.  Args:     state_dict (dict): a dict containing parameters and         persistent buffers.     strict (bool, optional): whether to strictly enforce that the keys         in :attr:`state_dict` match the keys returned by this module's         :meth:`~torch.nn.Module.state_dict` function. Default: ``True``     assign (bool, optional): When ``False``, the properties of the tensors         in the current module are preserved while when ``True``, the         properties of the Tensors in the state dict are preserved. The only         exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s         for which the value from the module is preserved.         Default: ``False``  Returns:     ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields:         * **missing_keys** is a list of str containing any keys that are expected             by this module but missing from the provided ``state_dict``.         * **unexpected_keys** is a list of str containing the keys that are not             expected by this module but present in the provided ``state_dict``.  Note:     If a parameter or buffer is registered as ``None`` and its corresponding key     exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a     ``RuntimeError``.
+    If :attr:`strict` is ``True``, then the keys of :attr:`state_dict` must exactly match the keys returned by this module's :meth:`~torch.nn.Module.state_dict` function.  .. warning::     If :attr:`assign` is ``True`` the optimizer must be created after     the call to :attr:`load_state_dict` unless     :func:`~torch.__future__.get_swap_module_params_on_conversion` is ``True``.  Args:     state_dict (dict): a dict containing parameters and         persistent buffers.     strict (bool, optional): whether to strictly enforce that the keys         in :attr:`state_dict` match the keys returned by this module's         :meth:`~torch.nn.Module.state_dict` function. Default: ``True``     assign (bool, optional): When ``False``, the properties of the tensors         in the current module are preserved while when ``True``, the         properties of the Tensors in the state dict are preserved. The only         exception is the ``requires_grad`` field of :class:`~torch.nn.Parameter`s         for which the value from the module is preserved.         Default: ``False``  Returns:     ``NamedTuple`` with ``missing_keys`` and ``unexpected_keys`` fields:         * **missing_keys** is a list of str containing the missing keys         * **unexpected_keys** is a list of str containing the unexpected keys  Note:     If a parameter or buffer is registered as ``None`` and its corresponding key     exists in :attr:`state_dict`, :meth:`load_state_dict` will raise a     ``RuntimeError``.
 
     **Parameters**