-
Notifications
You must be signed in to change notification settings - Fork 16
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Initial working draft * Adding PyLate similarity function (maxsim) and use it as the default for ColBERT models * Remove default score function in NanoBEIR evaluator (not needed anymore) * Remove hardcoded similarity function in the model card template * Rename files * fix circular important and remove duplicate code * Ruff formating * Remove examples including cosine * Add model_card_template to setup * Renaming mapping dicts * Fixing docstrings, examples and extends NanoBEIREvaluator * Documentation --------- Co-authored-by: Antoine Chaffin <[email protected]>
- Loading branch information
Showing
20 changed files
with
919 additions
and
82 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
# NanoBEIREvaluator | ||
|
||
This class evaluates the performance of a PyLate Model on the NanoBEIR collection of datasets. This is a direct extension of the NanoBEIREvaluator from the sentence-transformers library, leveraging the PyLateInformationRetrievalEvaluator class. | ||
|
||
The collection is a set of datasets based on the BEIR collection, but with a significantly smaller size, so it can be used for quickly evaluating the retrieval performance of a model before commiting to a full evaluation. The datasets are available on HuggingFace at https://huggingface.co/collections/zeta-alpha-ai/nanobeir-66e1a0af21dfd93e620cd9f6 The Evaluator will return the same metrics as the InformationRetrievalEvaluator (i.e., MRR, nDCG, Recall@k), for each dataset and on average. Examples -------- >>> from pylate import models, evaluation >>> model = models.ColBERT(model_name_or_path="lightonai/colbertv2.0") >>> datasets = ["SciFact"] >>> evaluator = evaluation.NanoBEIREvaluator(dataset_names=datasets) >>> results = evaluator(model) >>> results {'NanoSciFact_MaxSim_accuracy@1': 0.62, 'NanoSciFact_MaxSim_accuracy@3': 0.74, 'NanoSciFact_MaxSim_accuracy@5': 0.8, 'NanoSciFact_MaxSim_accuracy@10': 0.86, 'NanoSciFact_MaxSim_precision@1': np.float64(0.62), 'NanoSciFact_MaxSim_precision@3': np.float64(0.26666666666666666), 'NanoSciFact_MaxSim_precision@5': np.float64(0.18), 'NanoSciFact_MaxSim_precision@10': np.float64(0.096), 'NanoSciFact_MaxSim_recall@1': np.float64(0.595), 'NanoSciFact_MaxSim_recall@3': np.float64(0.715), 'NanoSciFact_MaxSim_recall@5': np.float64(0.79), 'NanoSciFact_MaxSim_recall@10': np.float64(0.85), 'NanoSciFact_MaxSim_ndcg@10': np.float64(0.7279903941189909), 'NanoSciFact_MaxSim_mrr@10': 0.6912222222222222, 'NanoSciFact_MaxSim_map@100': np.float64(0.6903374780806633), 'NanoBEIR_mean_MaxSim_accuracy@1': np.float64(0.62), 'NanoBEIR_mean_MaxSim_accuracy@3': np.float64(0.74), 'NanoBEIR_mean_MaxSim_accuracy@5': np.float64(0.8), 'NanoBEIR_mean_MaxSim_accuracy@10': np.float64(0.86), 'NanoBEIR_mean_MaxSim_precision@1': np.float64(0.62), 'NanoBEIR_mean_MaxSim_precision@3': np.float64(0.26666666666666666), 'NanoBEIR_mean_MaxSim_precision@5': np.float64(0.18), 'NanoBEIR_mean_MaxSim_precision@10': np.float64(0.096), 'NanoBEIR_mean_MaxSim_recall@1': np.float64(0.595), 'NanoBEIR_mean_MaxSim_recall@3': np.float64(0.715), 'NanoBEIR_mean_MaxSim_recall@5': np.float64(0.79), 'NanoBEIR_mean_MaxSim_recall@10': np.float64(0.85), 'NanoBEIR_mean_MaxSim_ndcg@10': np.float64(0.7279903941189909), 'NanoBEIR_mean_MaxSim_mrr@10': np.float64(0.6912222222222222), 'NanoBEIR_mean_MaxSim_map@100': np.float64(0.6903374780806633)} | ||
|
||
## Parameters | ||
|
||
- **dataset_names** (*'list[DatasetNameType] | None'*) – defaults to `None` | ||
|
||
- **mrr_at_k** (*'list[int]'*) – defaults to `[10]` | ||
|
||
- **ndcg_at_k** (*'list[int]'*) – defaults to `[10]` | ||
|
||
- **accuracy_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]` | ||
|
||
- **precision_recall_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]` | ||
|
||
- **map_at_k** (*'list[int]'*) – defaults to `[100]` | ||
|
||
- **show_progress_bar** (*'bool'*) – defaults to `False` | ||
|
||
- **batch_size** (*'int'*) – defaults to `32` | ||
|
||
- **write_csv** (*'bool'*) – defaults to `True` | ||
|
||
- **truncate_dim** (*'int | None'*) – defaults to `None` | ||
|
||
- **score_functions** (*'dict[str, Callable[[Tensor, Tensor], Tensor]]'*) – defaults to `None` | ||
|
||
- **main_score_function** (*'str | SimilarityFunction | None'*) – defaults to `None` | ||
|
||
- **aggregate_fn** (*'Callable[[list[float]], float]'*) – defaults to `<function mean at 0x7fe7b9322480>` | ||
|
||
- **aggregate_key** (*'str'*) – defaults to `mean` | ||
|
||
- **query_prompts** (*'str | dict[str, str] | None'*) – defaults to `None` | ||
|
||
- **corpus_prompts** (*'str | dict[str, str] | None'*) – defaults to `None` | ||
|
||
|
||
## Attributes | ||
|
||
- **description** | ||
|
||
Returns a human-readable description of the evaluator: BinaryClassificationEvaluator -> Binary Classification 1. Remove "Evaluator" from the class name 2. Add a space before every capital letter | ||
|
||
|
||
|
||
## Methods | ||
|
||
???- note "__call__" | ||
|
||
This is called during training to evaluate the model. It returns a score for the evaluation with a higher score indicating a better result. | ||
|
||
Args: model: the model to evaluate output_path: path where predictions and metrics are written to epoch: the epoch where the evaluation takes place. This is used for the file prefixes. If this is -1, then we assume evaluation on test data. steps: the steps in the current epoch at time of the evaluation. This is used for the file prefixes. If this is -1, then we assume evaluation at the end of the epoch. Returns: Either a score for the evaluation with a higher score indicating a better result, or a dictionary with scores. If the latter is chosen, then `evaluator.primary_metric` must be defined | ||
|
||
**Parameters** | ||
|
||
- **model** (*'SentenceTransformer'*) | ||
- **output_path** (*'str'*) – defaults to `None` | ||
- **epoch** (*'int'*) – defaults to `-1` | ||
- **steps** (*'int'*) – defaults to `-1` | ||
- **args** | ||
- **kwargs** | ||
|
||
???- note "prefix_name_to_metrics" | ||
|
||
???- note "store_metrics_in_model_card_data" | ||
|
86 changes: 86 additions & 0 deletions
86
docs/api/evaluation/PyLateInformationRetrievalEvaluator.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
# PyLateInformationRetrievalEvaluator | ||
|
||
This class evaluates an Information Retrieval (IR) setting. This is a direct extension of the InformationRetrievalEvaluator from the sentence-transformers library, only override the compute_metrices method to be compilatible with PyLate models (define assymetric encoding using is_query params and add padding). | ||
|
||
|
||
|
||
## Parameters | ||
|
||
- **queries** (*'dict[str, str]'*) | ||
|
||
- **corpus** (*'dict[str, str]'*) | ||
|
||
- **relevant_docs** (*'dict[str, set[str]]'*) | ||
|
||
- **corpus_chunk_size** (*'int'*) – defaults to `50000` | ||
|
||
- **mrr_at_k** (*'list[int]'*) – defaults to `[10]` | ||
|
||
- **ndcg_at_k** (*'list[int]'*) – defaults to `[10]` | ||
|
||
- **accuracy_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]` | ||
|
||
- **precision_recall_at_k** (*'list[int]'*) – defaults to `[1, 3, 5, 10]` | ||
|
||
- **map_at_k** (*'list[int]'*) – defaults to `[100]` | ||
|
||
- **show_progress_bar** (*'bool'*) – defaults to `False` | ||
|
||
- **batch_size** (*'int'*) – defaults to `32` | ||
|
||
- **name** (*'str'*) – defaults to `` | ||
|
||
- **write_csv** (*'bool'*) – defaults to `True` | ||
|
||
- **truncate_dim** (*'int | None'*) – defaults to `None` | ||
|
||
- **score_functions** (*'dict[str, Callable[[Tensor, Tensor], Tensor]] | None'*) – defaults to `None` | ||
|
||
- **main_score_function** (*'str | SimilarityFunction | None'*) – defaults to `None` | ||
|
||
- **query_prompt** (*'str | None'*) – defaults to `None` | ||
|
||
- **query_prompt_name** (*'str | None'*) – defaults to `None` | ||
|
||
- **corpus_prompt** (*'str | None'*) – defaults to `None` | ||
|
||
- **corpus_prompt_name** (*'str | None'*) – defaults to `None` | ||
|
||
|
||
## Attributes | ||
|
||
- **description** | ||
|
||
Returns a human-readable description of the evaluator: BinaryClassificationEvaluator -> Binary Classification 1. Remove "Evaluator" from the class name 2. Add a space before every capital letter | ||
|
||
|
||
|
||
## Methods | ||
|
||
???- note "__call__" | ||
|
||
This is called during training to evaluate the model. It returns a score for the evaluation with a higher score indicating a better result. | ||
|
||
Args: model: the model to evaluate output_path: path where predictions and metrics are written to epoch: the epoch where the evaluation takes place. This is used for the file prefixes. If this is -1, then we assume evaluation on test data. steps: the steps in the current epoch at time of the evaluation. This is used for the file prefixes. If this is -1, then we assume evaluation at the end of the epoch. Returns: Either a score for the evaluation with a higher score indicating a better result, or a dictionary with scores. If the latter is chosen, then `evaluator.primary_metric` must be defined | ||
|
||
**Parameters** | ||
|
||
- **model** (*'SentenceTransformer'*) | ||
- **output_path** (*'str'*) – defaults to `None` | ||
- **epoch** (*'int'*) – defaults to `-1` | ||
- **steps** (*'int'*) – defaults to `-1` | ||
- **args** | ||
- **kwargs** | ||
|
||
???- note "compute_dcg_at_k" | ||
|
||
???- note "compute_metrices" | ||
|
||
???- note "compute_metrics" | ||
|
||
???- note "output_scores" | ||
|
||
???- note "prefix_name_to_metrics" | ||
|
||
???- note "store_metrics_in_model_card_data" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
title: hf_hub |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,155 @@ | ||
# PylateModelCardData | ||
|
||
A dataclass for storing data used in the model card. | ||
|
||
|
||
|
||
## Parameters | ||
|
||
- **language** (*'str | list[str] | None'*) – defaults to `<factory>` | ||
|
||
The model language, either a string or a list of strings, e.g., "en" or ["en", "de", "nl"]. | ||
|
||
- **license** (*'str | None'*) – defaults to `None` | ||
|
||
The license of the model, e.g., "apache-2.0", "mit", or "cc-by-nc-sa-4.0". | ||
|
||
- **model_name** (*'str | None'*) – defaults to `None` | ||
|
||
The pretty name of the model, e.g., "SentenceTransformer based on microsoft/mpnet-base". | ||
|
||
- **model_id** (*'str | None'*) – defaults to `None` | ||
|
||
The model ID for pushing the model to the Hub, e.g., "tomaarsen/sbert-mpnet-base-allnli". | ||
|
||
- **train_datasets** (*'list[dict[str, str]]'*) – defaults to `<factory>` | ||
|
||
A list of dictionaries containing names and/or Hugging Face dataset IDs for training datasets, e.g., [{"name": "SNLI", "id": "stanfordnlp/snli"}, {"name": "MultiNLI", "id": "nyu-mll/multi_nli"}, {"name": "STSB"}]. | ||
|
||
- **eval_datasets** (*'list[dict[str, str]]'*) – defaults to `<factory>` | ||
|
||
A list of dictionaries containing names and/or Hugging Face dataset IDs for evaluation datasets, e.g., [{"name": "SNLI", "id": "stanfordnlp/snli"}, {"id": "mteb/stsbenchmark-sts"}]. | ||
|
||
- **task_name** (*'str'*) – defaults to `semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more` | ||
|
||
The human-readable task the model is trained on, e.g., "semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more". | ||
|
||
- **tags** (*'list[str] | None'*) – defaults to `<factory>` | ||
|
||
A list of tags for the model, e.g., ["sentence-transformers", "sentence-similarity", "feature-extraction"]. | ||
|
||
- **generate_widget_examples** (*"Literal['deprecated']"*) – defaults to `deprecated` | ||
|
||
|
||
## Attributes | ||
|
||
- **base_model** | ||
|
||
- **base_model_revision** | ||
|
||
- **best_model_step** | ||
|
||
- **code_carbon_callback** | ||
|
||
- **license** | ||
|
||
- **model** | ||
|
||
- **model_id** | ||
|
||
- **model_name** | ||
|
||
- **predict_example** | ||
|
||
- **trainer** | ||
|
||
|
||
|
||
## Methods | ||
|
||
???- note "add_tags" | ||
|
||
???- note "compute_dataset_metrics" | ||
|
||
Given a dataset, compute the following: * Dataset Size * Dataset Columns * Dataset Stats - Strings: min, mean, max word count/token length - Integers: Counter() instance - Floats: min, mean, max range - List: number of elements or min, mean, max number of elements * 3 Example samples * Loss function name - Loss function config | ||
|
||
**Parameters** | ||
|
||
- **dataset** (*'Dataset | IterableDataset | None'*) | ||
- **dataset_info** (*'dict[str, Any]'*) | ||
- **loss** (*'dict[str, nn.Module] | nn.Module | None'*) | ||
|
||
???- note "extract_dataset_metadata" | ||
|
||
???- note "format_eval_metrics" | ||
|
||
Format the evaluation metrics for the model card. | ||
|
||
The following keys will be returned: - eval_metrics: A list of dictionaries containing the class name, description, dataset name, and a markdown table This is used to display the evaluation metrics in the model card. - metrics: A list of all metric keys. This is used in the model card metadata. - model-index: A list of dictionaries containing the task name, task type, dataset type, dataset name, metric name, metric type, and metric value. This is used to display the evaluation metrics in the model card metadata. | ||
|
||
|
||
???- note "format_training_logs" | ||
|
||
???- note "get" | ||
|
||
Get value for a given metadata key. | ||
|
||
**Parameters** | ||
|
||
- **key** (*str*) | ||
- **default** (*Any*) – defaults to `None` | ||
|
||
???- note "get_codecarbon_data" | ||
|
||
???- note "infer_datasets" | ||
|
||
???- note "pop" | ||
|
||
Pop value for a given metadata key. | ||
|
||
**Parameters** | ||
|
||
- **key** (*str*) | ||
- **default** (*Any*) – defaults to `None` | ||
|
||
???- note "register_model" | ||
|
||
???- note "set_base_model" | ||
|
||
???- note "set_best_model_step" | ||
|
||
???- note "set_evaluation_metrics" | ||
|
||
???- note "set_label_examples" | ||
|
||
???- note "set_language" | ||
|
||
???- note "set_license" | ||
|
||
???- note "set_losses" | ||
|
||
???- note "set_model_id" | ||
|
||
???- note "set_widget_examples" | ||
|
||
???- note "to_dict" | ||
|
||
Converts CardData to a dict. | ||
|
||
Returns: `dict`: CardData represented as a dictionary ready to be dumped to a YAML block for inclusion in a README.md file. | ||
|
||
|
||
???- note "to_yaml" | ||
|
||
Dumps CardData to a YAML block for inclusion in a README.md file. | ||
|
||
Args: line_break (str, *optional*): The line break to use when dumping to yaml. Returns: `str`: CardData represented as a YAML block. | ||
|
||
**Parameters** | ||
|
||
- **line_break** – defaults to `None` | ||
|
||
???- note "try_to_set_base_model" | ||
|
||
???- note "validate_datasets" | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.