Merge branch 'master' into helena/openvino-support

UKPLab · Jun 25, 2024 · c8c8906 · c8c8906
2 parents 5234be0 + 2dee8c2
commit c8c8906
Show file tree

Hide file tree

Showing 35 changed files with 315 additions and 175 deletions.
diff --git a/README.md b/README.md
@@ -53,7 +53,7 @@ If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA
 
 ## Getting Started
 
-See [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documenation.
+See [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documentation.
 
 First download a pretrained model.
 

diff --git a/docs/cross_encoder/pretrained_models.md b/docs/cross_encoder/pretrained_models.md
@@ -70,7 +70,7 @@ These models have been trained on the [Quora duplicate questions dataset](https:
 ```
 
 ## NLI
-Given two sentences, are these contradicting each other, entailing one the other or are these netural? The following models were trained on the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNLI](https://huggingface.co/datasets/nyu-mll/multi_nli) datasets.
+Given two sentences, are these contradicting each other, entailing one the other or are these neutral? The following models were trained on the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNLI](https://huggingface.co/datasets/nyu-mll/multi_nli) datasets.
 - [cross-encoder/nli-deberta-v3-base](https://huggingface.co/cross-encoder/nli-deberta-v3-base) - Accuracy on MNLI mismatched set: 90.04
 - [cross-encoder/nli-deberta-base](https://huggingface.co/cross-encoder/nli-deberta-base) - Accuracy on MNLI mismatched set: 88.08
 - [cross-encoder/nli-deberta-v3-xsmall](https://huggingface.co/cross-encoder/nli-deberta-v3-xsmall) - Accuracy on MNLI mismatched set:  87.77

diff --git a/docs/pretrained-models/msmarco-v1.md b/docs/pretrained-models/msmarco-v1.md
@@ -1,11 +1,11 @@
 # MSMARCO Models
 [MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.
 
-The training data constist of over 500k examples, while the complete  corpus consist of over 8.8 Million passages.
+The training data consists of over 500k examples, while the complete  corpus consist of over 8.8 Million passages.
 
 
 
-## Version Histroy 
+## Version History 
 
 ### v1
 Version 1 models were trained on the training set of MS Marco Passage retrieval task. The models were trained using in-batch negative sampling via the MultipleNegativesRankingLoss with a scaling factor of 20 and a batch size of 128.

diff --git a/docs/pretrained-models/msmarco-v2.md b/docs/pretrained-models/msmarco-v2.md
@@ -33,6 +33,6 @@ As baseline we show the results for lexical search with BM25 using Elasticsearch
 
 
 
-## Version Histroy 
+## Version History 
 
 - [Version 1](msmarco-v1.md)
diff --git a/docs/pretrained-models/msmarco-v3.md b/docs/pretrained-models/msmarco-v3.md
@@ -57,7 +57,7 @@ If they received a low score by the cross-encoder, we saved them as hard negativ
 
 We then trained the v2 models with these new hard negatives.
 
-## Version Histroy 
+## Version History 
 
 - [Version 2](msmarco-v2.md)
 - [Version 1](msmarco-v1.md)
diff --git a/docs/pretrained-models/msmarco-v5.md b/docs/pretrained-models/msmarco-v5.md
@@ -1,7 +1,7 @@
 # MSMARCO Models 
 [MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.
 
-The training data constist of over 500k examples, while the complete  corpus consist of over 8.8 Million passages.
+The training data consists of over 500k examples, while the complete  corpus consist of over 8.8 Million passages.
 
 ## Usage
 ```python
@@ -12,7 +12,7 @@ model = SentenceTransformer("msmarco-distilbert-dot-v5")
 query_embedding = model.encode("How big is London")
 passage_embedding = model.encode([
     "London has 9,787,426 inhabitants at the 2011 census",
-    "London is known for its finacial district",
+    "London is known for its financial district",
 ])
 
 print("Similarity:", util.dot_score(query_embedding, passage_embedding))

diff --git a/docs/pretrained-models/nq-v1.md b/docs/pretrained-models/nq-v1.md
@@ -1,5 +1,5 @@
 # Natural Questions Models
-[Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions) constists of about 100k real search queries from Google with the respective, relevant passage from Wikipedia. Models trained on this dataset work well for question-answer retrieval.
+[Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions) consists of about 100k real search queries from Google with the respective, relevant passage from Wikipedia. Models trained on this dataset work well for question-answer retrieval.
 
 ## Usage
 

diff --git a/docs/publications.md b/docs/publications.md
@@ -73,10 +73,10 @@ When you use the unsupervised learning example, please have a look at: [TSDAE: U
 }
 ```
 
-When you use the GenQ learning example, please have a look at: [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663):
+When you use the GenQ learning example, please have a look at: [BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663):
 ```bibtex  
 @inproceedings{thakur-2021-BEIR,
-    title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models",
+    title = "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models",
     author = {Thakur, Nandan and Reimers, Nils and R{\"{u}}ckl{\'{e}}, Andreas and Srivastava, Abhishek and Gurevych, Iryna}, 
     booktitle={Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) - Datasets and Benchmarks Track (Round 2)},
     month = "4",

diff --git a/docs/sentence_transformer/dataset_overview.md b/docs/sentence_transformer/dataset_overview.md
@@ -10,7 +10,7 @@ It is important that your dataset format matches your loss function (or that you
 
 In practice, most dataset configurations will take one of four forms:
 
-- **Positive Pair**: A pair of related sentences. This can be used both for symmetric tasks (semantic textual similarity) or assymetric tasks (semantic search), with examples including pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (`query`, `response`), or pairs of (`source_language`, `target_language`). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences.
+- **Positive Pair**: A pair of related sentences. This can be used both for symmetric tasks (semantic textual similarity) or asymmetric tasks (semantic search), with examples including pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (`query`, `response`), or pairs of (`source_language`, `target_language`). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences.
    - **Examples:** [sentence-transformers/sentence-compression](https://huggingface.co/datasets/sentence-transformers/sentence-compression), [sentence-transformers/coco-captions](https://huggingface.co/datasets/sentence-transformers/coco-captions), [sentence-transformers/codesearchnet](https://huggingface.co/datasets/sentence-transformers/codesearchnet), [sentence-transformers/natural-questions](https://huggingface.co/datasets/sentence-transformers/natural-questions), [sentence-transformers/gooaq](https://huggingface.co/datasets/sentence-transformers/gooaq), [sentence-transformers/squad](https://huggingface.co/datasets/sentence-transformers/squad), [sentence-transformers/wikihow](https://huggingface.co/datasets/sentence-transformers/wikihow), [sentence-transformers/eli5](https://huggingface.co/datasets/sentence-transformers/eli5)
 - **Triplets**: (anchor, positive, negative) text triplets. These datasets don't need labels.
    - **Examples:** [sentence-transformers/quora-duplicates](https://huggingface.co/datasets/sentence-transformers/quora-duplicates), [nirantk/triplets](https://huggingface.co/datasets/nirantk/triplets), [sentence-transformers/all-nli](https://huggingface.co/datasets/sentence-transformers/all-nli)

diff --git a/docs/sentence_transformer/pretrained_models.md b/docs/sentence_transformer/pretrained_models.md
@@ -65,7 +65,7 @@ model = SentenceTransformer("multi-qa-mpnet-base-cos-v1")
 
 query_embedding = model.encode("How big is London")
 passage_embeddings = model.encode([
-    "London is known for its finacial district",
+    "London is known for its financial district",
     "London has 9,787,426 inhabitants at the 2011 census",
     "The United Kingdom is the fourth largest exporter of goods in the world",
 ])

diff --git a/docs/sentence_transformer/training_overview.md b/docs/sentence_transformer/training_overview.md
@@ -279,7 +279,7 @@ args = SentenceTransformerTrainingArguments(
 
 You can provide the [`SentenceTransformerTrainer`](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer) with an `eval_dataset` to get the evaluation loss during training, but it may be useful to get more concrete metrics during training, too. For this, you can use evaluators to assess the model's performance with useful metrics before, during, or after training. You can both an `eval_dataset` and an evaluator, one or the other, or neither. They evaluate based on the `eval_strategy` and `eval_steps` [Training Arguments](#training-arguments).
 
-Here are the implemented Evaluators that come with Sentence Tranformers:
+Here are the implemented Evaluators that come with Sentence Transformers:
 ```eval_rst
 ========================================================================  ===========================================================================================================================
 Evaluator                                                                 Required Data

diff --git a/examples/applications/image-search/Image_Clustering.ipynb b/examples/applications/image-search/Image_Clustering.ipynb
diff --git a/examples/applications/image-search/Image_Duplicates.ipynb b/examples/applications/image-search/Image_Duplicates.ipynb
diff --git a/examples/applications/image-search/Image_Search-multilingual.ipynb b/examples/applications/image-search/Image_Search-multilingual.ipynb
diff --git a/examples/applications/image-search/Image_Search.ipynb b/examples/applications/image-search/Image_Search.ipynb
diff --git a/examples/training/quora_duplicate_questions/create_splits.py b/examples/training/quora_duplicate_questions/create_splits.py
@@ -128,7 +128,7 @@
         "but",
         "by",
         "can",
-        "couldn",
+        "couldn",  # codespell:ignore couldn
         "couldn't",
         "d",
         "did",

diff --git a/examples/training/sts/training_stsbenchmark.py b/examples/training/sts/training_stsbenchmark.py
@@ -3,10 +3,10 @@
 that can be compared using cosine-similarity to measure the similarity.
 
 Usage:
-python training_nli.py
+python training_stsbenchmark.py
 
 OR
-python training_nli.py pretrained_transformer_model_name
+python training_stsbenchmark.py pretrained_transformer_model_name
 """
 
 import logging

diff --git a/examples/unsupervised_learning/CT/train_stsb_ct.py b/examples/unsupervised_learning/CT/train_stsb_ct.py
@@ -18,7 +18,7 @@
 ## Training parameters
 model_name = "distilbert-base-uncased"
 batch_size = 16
-pos_neg_ratio = 8  # batch_size must be devisible by pos_neg_ratio
+pos_neg_ratio = 8  # batch_size must be divisible by pos_neg_ratio
 epochs = 1
 max_seq_length = 75
 

diff --git a/examples/unsupervised_learning/README.md b/examples/unsupervised_learning/README.md
@@ -45,7 +45,7 @@ BERT showed that Masked Language Model (MLM) is a powerful pre-training approach
 
 ## GenQ
 
-In our paper [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663)  we present a method to learn a semantic search method by generating queries for given passages. This method has been improved in [GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval](https://arxiv.org/abs/2112.07577).
+In our paper [BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663)  we present a method to learn a semantic search method by generating queries for given passages. This method has been improved in [GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval](https://arxiv.org/abs/2112.07577).
 
 We pass all passages in our collection through a trained T5 model, which generates potential queries from users. We then use these (query, passage) pairs to train a SentenceTransformer model.
 

diff --git a/examples/unsupervised_learning/query_generation/README.md b/examples/unsupervised_learning/query_generation/README.md
@@ -1,6 +1,6 @@
 # GenQ
 
-In our paper [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) we presented a method to adapt a model for [asymmetric semantic search](../../applications/semantic-search/) without for a corpus without labeled training data.
+In our paper [BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) we presented a method to adapt a model for [asymmetric semantic search](../../applications/semantic-search/) without for a corpus without labeled training data.
 
 ## Background
 In [asymmetric semantic search](../../applications/semantic-search/), the user provides a (short) query like some keywords or a question. We then want to retrieve a longer text passage that provides the answer.

diff --git a/requirements.txt b/requirements.txt
@@ -1,10 +1,10 @@
-transformers>=4.34.0,<5.0.0
+transformers>=4.38.0,<5.0.0
 tqdm
 torch>=1.11.0
-numpy
+numpy<2.0.0
 scikit-learn
 scipy
-huggingface-hub>=0.15.1
+huggingface-hub>=0.19.3
 Pillow
 datasets
 accelerate>=0.20.3

diff --git a/sentence_transformers/SentenceTransformer.py b/sentence_transformers/SentenceTransformer.py
@@ -354,20 +354,84 @@ def __init__(
         # Pass the model to the model card data for later use in generating a model card upon saving this model
         self.model_card_data.register_model(self)
 
+    @overload
+    def encode(
+        self,
+        sentences: str,
+        prompt_name: Optional[str] = ...,
+        prompt: Optional[str] = ...,
+        batch_size: int = ...,
+        show_progress_bar: Optional[bool] = ...,
+        output_value: Optional[Literal["sentence_embedding", "token_embeddings"]] = ...,
+        precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = ...,
+        convert_to_numpy: Literal[False] = ...,
+        convert_to_tensor: Literal[False] = ...,
+        device: str = ...,
+        normalize_embeddings: bool = ...,
+    ) -> Tensor: ...
+
+    @overload
+    def encode(
+        self,
+        sentences: Union[str, List[str]],
+        prompt_name: Optional[str] = ...,
+        prompt: Optional[str] = ...,
+        batch_size: int = ...,
+        show_progress_bar: Optional[bool] = ...,
+        output_value: Optional[Literal["sentence_embedding", "token_embeddings"]] = ...,
+        precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = ...,
+        convert_to_numpy: Literal[True] = ...,
+        convert_to_tensor: Literal[False] = ...,
+        device: str = ...,
+        normalize_embeddings: bool = ...,
+    ) -> np.ndarray: ...
+
+    @overload
+    def encode(
+        self,
+        sentences: Union[str, List[str]],
+        prompt_name: Optional[str] = ...,
+        prompt: Optional[str] = ...,
+        batch_size: int = ...,
+        show_progress_bar: Optional[bool] = ...,
+        output_value: Optional[Literal["sentence_embedding", "token_embeddings"]] = ...,
+        precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = ...,
+        convert_to_numpy: bool = ...,
+        convert_to_tensor: Literal[True] = ...,
+        device: str = ...,
+        normalize_embeddings: bool = ...,
+    ) -> Tensor: ...
+
+    @overload
+    def encode(
+        self,
+        sentences: Union[List[str], np.ndarray],
+        prompt_name: Optional[str] = ...,
+        prompt: Optional[str] = ...,
+        batch_size: int = ...,
+        show_progress_bar: Optional[bool] = ...,
+        output_value: Optional[Literal["sentence_embedding", "token_embeddings"]] = ...,
+        precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = ...,
+        convert_to_numpy: Literal[False] = ...,
+        convert_to_tensor: Literal[False] = ...,
+        device: str = ...,
+        normalize_embeddings: bool = ...,
+    ) -> List[Tensor]: ...
+
     def encode(
         self,
         sentences: Union[str, List[str]],
         prompt_name: Optional[str] = None,
         prompt: Optional[str] = None,
         batch_size: int = 32,
-        show_progress_bar: bool = None,
+        show_progress_bar: Optional[bool] = None,
         output_value: Optional[Literal["sentence_embedding", "token_embeddings"]] = "sentence_embedding",
         precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = "float32",
         convert_to_numpy: bool = True,
         convert_to_tensor: bool = False,
         device: str = None,
         normalize_embeddings: bool = False,
-    ) -> Union[List[Tensor], ndarray, Tensor]:
+    ) -> Union[List[Tensor], np.ndarray, Tensor]:
         """
         Computes sentence embeddings.
 
@@ -429,9 +493,7 @@ def encode(
 
         self.eval()
         if show_progress_bar is None:
-            show_progress_bar = (
-                logger.getEffectiveLevel() == logging.INFO or logger.getEffectiveLevel() == logging.DEBUG
-            )
+            show_progress_bar = logger.getEffectiveLevel() in (logging.INFO, logging.DEBUG)
 
         if convert_to_tensor:
             convert_to_numpy = False
@@ -565,7 +627,7 @@ def encode(
                 all_embeddings = torch.Tensor()
         elif convert_to_numpy:
             if not isinstance(all_embeddings, np.ndarray):
-                if all_embeddings[0].dtype == torch.bfloat16:
+                if all_embeddings and all_embeddings[0].dtype == torch.bfloat16:
                     all_embeddings = np.asarray([emb.float().numpy() for emb in all_embeddings])
                 else:
                     all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])
@@ -771,6 +833,7 @@ def encode_multi_process(
         prompt: Optional[str] = None,
         batch_size: int = 32,
         chunk_size: int = None,
+        show_progress_bar: Optional[bool] = None,
         precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = "float32",
         normalize_embeddings: bool = False,
     ) -> np.ndarray:
@@ -795,6 +858,7 @@ def encode_multi_process(
             batch_size (int): Encode sentences with batch size. (default: 32)
             chunk_size (int): Sentences are chunked and sent to the individual processes. If None, it determines a
                 sensible size. Defaults to None.
+            show_progress_bar (bool, optional): Whether to output a progress bar when encode sentences. Defaults to None.
             precision (Literal["float32", "int8", "uint8", "binary", "ubinary"]): The precision to use for the
                 embeddings. Can be "float32", "int8", "uint8", "binary", or "ubinary". All non-float32 precisions
                 are quantized embeddings. Quantized embeddings are smaller in size and faster to compute, but may
@@ -829,6 +893,9 @@ def main():
         if chunk_size is None:
             chunk_size = min(math.ceil(len(sentences) / len(pool["processes"]) / 10), 5000)
 
+        if show_progress_bar is None:
+            show_progress_bar = logger.getEffectiveLevel() in (logging.INFO, logging.DEBUG)
+
         logger.debug(f"Chunk data into {math.ceil(len(sentences) / chunk_size)} packages of size {chunk_size}")
 
         input_queue = pool["input"]
@@ -849,7 +916,10 @@ def main():
             last_chunk_id += 1
 
         output_queue = pool["output"]
-        results_list = sorted([output_queue.get() for _ in range(last_chunk_id)], key=lambda x: x[0])
+        results_list = sorted(
+            [output_queue.get() for _ in trange(last_chunk_id, desc="Chunks", disable=not show_progress_bar)],
+            key=lambda x: x[0],
+        )
         embeddings = np.concatenate([result[1] for result in results_list])
         return embeddings
 
@@ -942,7 +1012,7 @@ def get_sentence_embedding_dimension(self) -> Optional[int]:
                 break
         if self.truncate_dim is not None:
             # The user requested truncation. If they set it to a dim greater than output_dim,
-            # no truncation will actually happen. So return output_dim insead of self.truncate_dim
+            # no truncation will actually happen. So return output_dim instead of self.truncate_dim
             return min(output_dim or np.inf, self.truncate_dim)
         return output_dim
Original file line number	Diff line number	Diff line change
Expand Up		@@ -33,6 +33,6 @@ As baseline we show the results for lexical search with BM25 using Elasticsearch



		## Version Histroy
		## Version History

		- [Version 1](msmarco-v1.md)
-Original file line number
+Diff line change
@@ Expand Up / @@ -128,7 +128,7 @@ @@
             "but",
             "by",
             "can",
-            "couldn",
+            "couldn",  # codespell:ignore couldn
             "couldn't",
             "d",
             "did",
@@ Expand Down @@