Skip to content

Commit

Permalink
Merge branch 'master' into helena/openvino-support
Browse files Browse the repository at this point in the history
  • Loading branch information
helena-intel authored Jun 25, 2024
2 parents 5234be0 + 2dee8c2 commit c8c8906
Show file tree
Hide file tree
Showing 35 changed files with 315 additions and 175 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA

## Getting Started

See [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documenation.
See [Quickstart](https://www.sbert.net/docs/quickstart.html) in our documentation.

First download a pretrained model.

Expand Down
2 changes: 1 addition & 1 deletion docs/cross_encoder/pretrained_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ These models have been trained on the [Quora duplicate questions dataset](https:
```

## NLI
Given two sentences, are these contradicting each other, entailing one the other or are these netural? The following models were trained on the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNLI](https://huggingface.co/datasets/nyu-mll/multi_nli) datasets.
Given two sentences, are these contradicting each other, entailing one the other or are these neutral? The following models were trained on the [SNLI](https://huggingface.co/datasets/stanfordnlp/snli) and [MultiNLI](https://huggingface.co/datasets/nyu-mll/multi_nli) datasets.
- [cross-encoder/nli-deberta-v3-base](https://huggingface.co/cross-encoder/nli-deberta-v3-base) - Accuracy on MNLI mismatched set: 90.04
- [cross-encoder/nli-deberta-base](https://huggingface.co/cross-encoder/nli-deberta-base) - Accuracy on MNLI mismatched set: 88.08
- [cross-encoder/nli-deberta-v3-xsmall](https://huggingface.co/cross-encoder/nli-deberta-v3-xsmall) - Accuracy on MNLI mismatched set: 87.77
Expand Down
4 changes: 2 additions & 2 deletions docs/pretrained-models/msmarco-v1.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# MSMARCO Models
[MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.

The training data constist of over 500k examples, while the complete corpus consist of over 8.8 Million passages.
The training data consists of over 500k examples, while the complete corpus consist of over 8.8 Million passages.



## Version Histroy
## Version History

### v1
Version 1 models were trained on the training set of MS Marco Passage retrieval task. The models were trained using in-batch negative sampling via the MultipleNegativesRankingLoss with a scaling factor of 20 and a batch size of 128.
Expand Down
2 changes: 1 addition & 1 deletion docs/pretrained-models/msmarco-v2.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,6 @@ As baseline we show the results for lexical search with BM25 using Elasticsearch



## Version Histroy
## Version History

- [Version 1](msmarco-v1.md)
2 changes: 1 addition & 1 deletion docs/pretrained-models/msmarco-v3.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ If they received a low score by the cross-encoder, we saved them as hard negativ

We then trained the v2 models with these new hard negatives.

## Version Histroy
## Version History

- [Version 2](msmarco-v2.md)
- [Version 1](msmarco-v1.md)
4 changes: 2 additions & 2 deletions docs/pretrained-models/msmarco-v5.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# MSMARCO Models
[MS MARCO](https://microsoft.github.io/msmarco/) is a large scale information retrieval corpus that was created based on real user search queries using Bing search engine. The provided models can be used for semantic search, i.e., given keywords / a search phrase / a question, the model will find passages that are relevant for the search query.

The training data constist of over 500k examples, while the complete corpus consist of over 8.8 Million passages.
The training data consists of over 500k examples, while the complete corpus consist of over 8.8 Million passages.

## Usage
```python
Expand All @@ -12,7 +12,7 @@ model = SentenceTransformer("msmarco-distilbert-dot-v5")
query_embedding = model.encode("How big is London")
passage_embedding = model.encode([
"London has 9,787,426 inhabitants at the 2011 census",
"London is known for its finacial district",
"London is known for its financial district",
])

print("Similarity:", util.dot_score(query_embedding, passage_embedding))
Expand Down
2 changes: 1 addition & 1 deletion docs/pretrained-models/nq-v1.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Natural Questions Models
[Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions) constists of about 100k real search queries from Google with the respective, relevant passage from Wikipedia. Models trained on this dataset work well for question-answer retrieval.
[Google's Natural Questions dataset](https://ai.google.com/research/NaturalQuestions) consists of about 100k real search queries from Google with the respective, relevant passage from Wikipedia. Models trained on this dataset work well for question-answer retrieval.

## Usage

Expand Down
4 changes: 2 additions & 2 deletions docs/publications.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,10 +73,10 @@ When you use the unsupervised learning example, please have a look at: [TSDAE: U
}
```

When you use the GenQ learning example, please have a look at: [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663):
When you use the GenQ learning example, please have a look at: [BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663):
```bibtex
@inproceedings{thakur-2021-BEIR,
title = "BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models",
title = "BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models",
author = {Thakur, Nandan and Reimers, Nils and R{\"{u}}ckl{\'{e}}, Andreas and Srivastava, Abhishek and Gurevych, Iryna},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021) - Datasets and Benchmarks Track (Round 2)},
month = "4",
Expand Down
2 changes: 1 addition & 1 deletion docs/sentence_transformer/dataset_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ It is important that your dataset format matches your loss function (or that you

In practice, most dataset configurations will take one of four forms:

- **Positive Pair**: A pair of related sentences. This can be used both for symmetric tasks (semantic textual similarity) or assymetric tasks (semantic search), with examples including pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (`query`, `response`), or pairs of (`source_language`, `target_language`). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences.
- **Positive Pair**: A pair of related sentences. This can be used both for symmetric tasks (semantic textual similarity) or asymmetric tasks (semantic search), with examples including pairs of paraphrases, pairs of full texts and their summaries, pairs of duplicate questions, pairs of (`query`, `response`), or pairs of (`source_language`, `target_language`). Natural Language Inference datasets can also be formatted this way by pairing entailing sentences.
- **Examples:** [sentence-transformers/sentence-compression](https://huggingface.co/datasets/sentence-transformers/sentence-compression), [sentence-transformers/coco-captions](https://huggingface.co/datasets/sentence-transformers/coco-captions), [sentence-transformers/codesearchnet](https://huggingface.co/datasets/sentence-transformers/codesearchnet), [sentence-transformers/natural-questions](https://huggingface.co/datasets/sentence-transformers/natural-questions), [sentence-transformers/gooaq](https://huggingface.co/datasets/sentence-transformers/gooaq), [sentence-transformers/squad](https://huggingface.co/datasets/sentence-transformers/squad), [sentence-transformers/wikihow](https://huggingface.co/datasets/sentence-transformers/wikihow), [sentence-transformers/eli5](https://huggingface.co/datasets/sentence-transformers/eli5)
- **Triplets**: (anchor, positive, negative) text triplets. These datasets don't need labels.
- **Examples:** [sentence-transformers/quora-duplicates](https://huggingface.co/datasets/sentence-transformers/quora-duplicates), [nirantk/triplets](https://huggingface.co/datasets/nirantk/triplets), [sentence-transformers/all-nli](https://huggingface.co/datasets/sentence-transformers/all-nli)
Expand Down
2 changes: 1 addition & 1 deletion docs/sentence_transformer/pretrained_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ model = SentenceTransformer("multi-qa-mpnet-base-cos-v1")

query_embedding = model.encode("How big is London")
passage_embeddings = model.encode([
"London is known for its finacial district",
"London is known for its financial district",
"London has 9,787,426 inhabitants at the 2011 census",
"The United Kingdom is the fourth largest exporter of goods in the world",
])
Expand Down
2 changes: 1 addition & 1 deletion docs/sentence_transformer/training_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@ args = SentenceTransformerTrainingArguments(

You can provide the [`SentenceTransformerTrainer`](https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html#sentence_transformers.SentenceTransformer) with an `eval_dataset` to get the evaluation loss during training, but it may be useful to get more concrete metrics during training, too. For this, you can use evaluators to assess the model's performance with useful metrics before, during, or after training. You can both an `eval_dataset` and an evaluator, one or the other, or neither. They evaluate based on the `eval_strategy` and `eval_steps` [Training Arguments](#training-arguments).

Here are the implemented Evaluators that come with Sentence Tranformers:
Here are the implemented Evaluators that come with Sentence Transformers:
```eval_rst
======================================================================== ===========================================================================================================================
Evaluator Required Data
Expand Down
64 changes: 32 additions & 32 deletions examples/applications/image-search/Image_Clustering.ipynb

Large diffs are not rendered by default.

84 changes: 42 additions & 42 deletions examples/applications/image-search/Image_Duplicates.ipynb

Large diffs are not rendered by default.

50 changes: 28 additions & 22 deletions examples/applications/image-search/Image_Search-multilingual.ipynb

Large diffs are not rendered by default.

46 changes: 23 additions & 23 deletions examples/applications/image-search/Image_Search.ipynb

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@
"but",
"by",
"can",
"couldn",
"couldn", # codespell:ignore couldn
"couldn't",
"d",
"did",
Expand Down
4 changes: 2 additions & 2 deletions examples/training/sts/training_stsbenchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@
that can be compared using cosine-similarity to measure the similarity.
Usage:
python training_nli.py
python training_stsbenchmark.py
OR
python training_nli.py pretrained_transformer_model_name
python training_stsbenchmark.py pretrained_transformer_model_name
"""

import logging
Expand Down
2 changes: 1 addition & 1 deletion examples/unsupervised_learning/CT/train_stsb_ct.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
## Training parameters
model_name = "distilbert-base-uncased"
batch_size = 16
pos_neg_ratio = 8 # batch_size must be devisible by pos_neg_ratio
pos_neg_ratio = 8 # batch_size must be divisible by pos_neg_ratio
epochs = 1
max_seq_length = 75

Expand Down
2 changes: 1 addition & 1 deletion examples/unsupervised_learning/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ BERT showed that Masked Language Model (MLM) is a powerful pre-training approach

## GenQ

In our paper [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) we present a method to learn a semantic search method by generating queries for given passages. This method has been improved in [GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval](https://arxiv.org/abs/2112.07577).
In our paper [BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) we present a method to learn a semantic search method by generating queries for given passages. This method has been improved in [GPL: Generative Pseudo Labeling for Unsupervised Domain Adaptation of Dense Retrieval](https://arxiv.org/abs/2112.07577).

We pass all passages in our collection through a trained T5 model, which generates potential queries from users. We then use these (query, passage) pairs to train a SentenceTransformer model.

Expand Down
2 changes: 1 addition & 1 deletion examples/unsupervised_learning/query_generation/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# GenQ

In our paper [BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) we presented a method to adapt a model for [asymmetric semantic search](../../applications/semantic-search/) without for a corpus without labeled training data.
In our paper [BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models](https://arxiv.org/abs/2104.08663) we presented a method to adapt a model for [asymmetric semantic search](../../applications/semantic-search/) without for a corpus without labeled training data.

## Background
In [asymmetric semantic search](../../applications/semantic-search/), the user provides a (short) query like some keywords or a question. We then want to retrieve a longer text passage that provides the answer.
Expand Down
6 changes: 3 additions & 3 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
transformers>=4.34.0,<5.0.0
transformers>=4.38.0,<5.0.0
tqdm
torch>=1.11.0
numpy
numpy<2.0.0
scikit-learn
scipy
huggingface-hub>=0.15.1
huggingface-hub>=0.19.3
Pillow
datasets
accelerate>=0.20.3
Expand Down
86 changes: 78 additions & 8 deletions sentence_transformers/SentenceTransformer.py
Original file line number Diff line number Diff line change
Expand Up @@ -354,20 +354,84 @@ def __init__(
# Pass the model to the model card data for later use in generating a model card upon saving this model
self.model_card_data.register_model(self)

@overload
def encode(
self,
sentences: str,
prompt_name: Optional[str] = ...,
prompt: Optional[str] = ...,
batch_size: int = ...,
show_progress_bar: Optional[bool] = ...,
output_value: Optional[Literal["sentence_embedding", "token_embeddings"]] = ...,
precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = ...,
convert_to_numpy: Literal[False] = ...,
convert_to_tensor: Literal[False] = ...,
device: str = ...,
normalize_embeddings: bool = ...,
) -> Tensor: ...

@overload
def encode(
self,
sentences: Union[str, List[str]],
prompt_name: Optional[str] = ...,
prompt: Optional[str] = ...,
batch_size: int = ...,
show_progress_bar: Optional[bool] = ...,
output_value: Optional[Literal["sentence_embedding", "token_embeddings"]] = ...,
precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = ...,
convert_to_numpy: Literal[True] = ...,
convert_to_tensor: Literal[False] = ...,
device: str = ...,
normalize_embeddings: bool = ...,
) -> np.ndarray: ...

@overload
def encode(
self,
sentences: Union[str, List[str]],
prompt_name: Optional[str] = ...,
prompt: Optional[str] = ...,
batch_size: int = ...,
show_progress_bar: Optional[bool] = ...,
output_value: Optional[Literal["sentence_embedding", "token_embeddings"]] = ...,
precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = ...,
convert_to_numpy: bool = ...,
convert_to_tensor: Literal[True] = ...,
device: str = ...,
normalize_embeddings: bool = ...,
) -> Tensor: ...

@overload
def encode(
self,
sentences: Union[List[str], np.ndarray],
prompt_name: Optional[str] = ...,
prompt: Optional[str] = ...,
batch_size: int = ...,
show_progress_bar: Optional[bool] = ...,
output_value: Optional[Literal["sentence_embedding", "token_embeddings"]] = ...,
precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = ...,
convert_to_numpy: Literal[False] = ...,
convert_to_tensor: Literal[False] = ...,
device: str = ...,
normalize_embeddings: bool = ...,
) -> List[Tensor]: ...

def encode(
self,
sentences: Union[str, List[str]],
prompt_name: Optional[str] = None,
prompt: Optional[str] = None,
batch_size: int = 32,
show_progress_bar: bool = None,
show_progress_bar: Optional[bool] = None,
output_value: Optional[Literal["sentence_embedding", "token_embeddings"]] = "sentence_embedding",
precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = "float32",
convert_to_numpy: bool = True,
convert_to_tensor: bool = False,
device: str = None,
normalize_embeddings: bool = False,
) -> Union[List[Tensor], ndarray, Tensor]:
) -> Union[List[Tensor], np.ndarray, Tensor]:
"""
Computes sentence embeddings.
Expand Down Expand Up @@ -429,9 +493,7 @@ def encode(

self.eval()
if show_progress_bar is None:
show_progress_bar = (
logger.getEffectiveLevel() == logging.INFO or logger.getEffectiveLevel() == logging.DEBUG
)
show_progress_bar = logger.getEffectiveLevel() in (logging.INFO, logging.DEBUG)

if convert_to_tensor:
convert_to_numpy = False
Expand Down Expand Up @@ -565,7 +627,7 @@ def encode(
all_embeddings = torch.Tensor()
elif convert_to_numpy:
if not isinstance(all_embeddings, np.ndarray):
if all_embeddings[0].dtype == torch.bfloat16:
if all_embeddings and all_embeddings[0].dtype == torch.bfloat16:
all_embeddings = np.asarray([emb.float().numpy() for emb in all_embeddings])
else:
all_embeddings = np.asarray([emb.numpy() for emb in all_embeddings])
Expand Down Expand Up @@ -771,6 +833,7 @@ def encode_multi_process(
prompt: Optional[str] = None,
batch_size: int = 32,
chunk_size: int = None,
show_progress_bar: Optional[bool] = None,
precision: Literal["float32", "int8", "uint8", "binary", "ubinary"] = "float32",
normalize_embeddings: bool = False,
) -> np.ndarray:
Expand All @@ -795,6 +858,7 @@ def encode_multi_process(
batch_size (int): Encode sentences with batch size. (default: 32)
chunk_size (int): Sentences are chunked and sent to the individual processes. If None, it determines a
sensible size. Defaults to None.
show_progress_bar (bool, optional): Whether to output a progress bar when encode sentences. Defaults to None.
precision (Literal["float32", "int8", "uint8", "binary", "ubinary"]): The precision to use for the
embeddings. Can be "float32", "int8", "uint8", "binary", or "ubinary". All non-float32 precisions
are quantized embeddings. Quantized embeddings are smaller in size and faster to compute, but may
Expand Down Expand Up @@ -829,6 +893,9 @@ def main():
if chunk_size is None:
chunk_size = min(math.ceil(len(sentences) / len(pool["processes"]) / 10), 5000)

if show_progress_bar is None:
show_progress_bar = logger.getEffectiveLevel() in (logging.INFO, logging.DEBUG)

logger.debug(f"Chunk data into {math.ceil(len(sentences) / chunk_size)} packages of size {chunk_size}")

input_queue = pool["input"]
Expand All @@ -849,7 +916,10 @@ def main():
last_chunk_id += 1

output_queue = pool["output"]
results_list = sorted([output_queue.get() for _ in range(last_chunk_id)], key=lambda x: x[0])
results_list = sorted(
[output_queue.get() for _ in trange(last_chunk_id, desc="Chunks", disable=not show_progress_bar)],
key=lambda x: x[0],
)
embeddings = np.concatenate([result[1] for result in results_list])
return embeddings

Expand Down Expand Up @@ -942,7 +1012,7 @@ def get_sentence_embedding_dimension(self) -> Optional[int]:
break
if self.truncate_dim is not None:
# The user requested truncation. If they set it to a dim greater than output_dim,
# no truncation will actually happen. So return output_dim insead of self.truncate_dim
# no truncation will actually happen. So return output_dim instead of self.truncate_dim
return min(output_dim or np.inf, self.truncate_dim)
return output_dim

Expand Down
Loading

0 comments on commit c8c8906

Please sign in to comment.