v3.4.0 - Resolved memory leak when deleting a model & trainer; add Matryoshka & Cached loss compatibility; small features & bug fixes
LatestThis release resolves a memory leak when deleting a model & trainer, adds compatibility between the Cached... losses and the Matryoshka loss modifier, resolves numerous bugs, and adds several small features.
Install this version with
# Training + Inference
pip install sentence-transformers[train]==3.4.0
# Inference only, use one of:
pip install sentence-transformers==3.4.0
pip install sentence-transformers[onnx-gpu]==3.4.0
pip install sentence-transformers[onnx]==3.4.0
pip install sentence-transformers[openvino]==3.4.0
Matryoshka & Cached loss compatibility (#3068, #3107)
It is now possible to combine the strong Cached losses (CachedMultipleNegativesRankingLoss, CachedGISTEmbedLoss, CachedMultipleNegativesSymmetricRankingLoss) with the Matryoshka loss modifier:
from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
from datasets import Dataset
model = SentenceTransformer("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({
"anchor": ["It's nice weather outside today.", "He drove to work."],
"positive": ["It's so sunny.", "He took the car to the office."],
})
loss = losses.CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)
loss = losses.MatryoshkaLoss(model, loss, [768, 512, 256, 128, 64])
trainer = SentenceTransformerTrainer(
model=model,
train_dataset=train_dataset,
loss=loss,
)
trainer.train()
See for example tomaarsen/mpnet-base-gooaq-cmnrl-mrl which was trained with CachedMultipleNegativesRankingLoss (CMNRL) with the Matryoshka loss modifier (MRL).
Resolve memory leak when Model and Trainer are reinitialized (#3144)
Due to a circular dependency in the SentenceTransformerTrainer
-> SentenceTransformer
-> SentenceTransformerModelCardData
-> SentenceTransformerTrainer
, deleting the trainer and model still doesn't clear them up via garbage disposal. I've moved a lot of components around, and now SentenceTransformerModelCardData
does not need to store the SentenceTransformerTrainer
, breaking the cycle.
We ran the seed optimization script (which frequently creates and deletes models and trainers):
- Before: Approximate highest recorded VRAM:
16332MiB / 24576MiB
- After: Approximate highest recorded VRAM:
8222MiB / 24576MiB
Small Features
- Add Matthews Correlation Coefficient to the BinaryClassificationEvaluator in #3051.
- Add a triplet
margin
parameter to the TripletEvaluator in #2862. - Put dataset information in the automatically generated model card in "expanding sections" blocks if there are many datasets in #3088.
- Add multi-GPU (and CPU multi-process) support for
mine_hard_negatives
in #2967.
Notable Bug Fixes
- Subsequent batches were identical when using the
no_duplicates
Batch Sampler (#3069). This has been resolved in #3073 - The old-style
model.fit()
training withwrite_csv
on an evaluator would crash (#3062). This has been resolved in #3066. - The output types of some evaluators were
np.float
instead offloat
(#3075). This has been resolved in #3076 and #3096. - It was not possible to specify a
revision
orcache_dir
when loading a PEFT Adapter model (#3061). This has been resolved in #3079 and #3174. - The CrossEncoder was lazily placed on the incorrect device, did not respond to
model.to
(#3078). This has been resolved in #3104. - If a model used a custom module with custom kwargs, those
kwargs
keys were not saved inmodules.json
correctly, e.g. relevant for jina-embeddings-v3 (#3111). This has been resolved in #3112. HfArgumentParser(SentenceTransformerTrainingArguments)
would crash due toprompts
typing (#3090). This has been resolved in #3178.
Example Updates
- Update the quantization script in #3070.
- Update the seed optimization script in #3092.
- Update the TSDAE scripts in #3137.
- Add PEFT Adapter script in #3180.
Documentation Updates
- Add PEFT Adapter documentation in #3180.
- Add links to backend-export in Speeding up Inference.
All Changes
- [
training
] Passsteps
/epoch
/output_path
to Evaluator during training by @tomaarsen in #3066 - [
examples
] Update the quantization script by @tomaarsen in #3070 - [
fix
] Fix different batches per epoch in NoDuplicatesBatchSampler by @tomaarsen in #3073 - [
docs
] Add links to backend-export in Speeding up Inference by @tomaarsen in #3071 - add MCC to BinaryClassificationEvaluator by @JINO-ROHIT in #3051
- support cached losses in combination with matryoshka loss by @Marcel256 in #3068
- align model_card_templates.py with code by @amitport in #3081
- converting np float result to float in binary classification evaluator by @JINO-ROHIT in #3076
- Add triplet margin for distance functions in TripletEvaluator by @zivicmilos in #2862
- [
model_card
] Keep the model card readable even with many datasets by @tomaarsen in #3088 - [
docs
] Add NanoBEIR to the Training Overview evaluators by @tomaarsen in #3089 - [fix] revision of the adapter model can now be specified. by @pesuchin in #3079
- [
docs
] Update from Sphinx==3.5.4 to 8.1.3, recommonmark -> myst-parser by @tomaarsen in #3099 - normalize to float in NanoBEIREvaluator, InformationRetrievalEvaluator, MSEEvaluator by @JINO-ROHIT in #3096
- [
docs
] List 'prompts' as a key training argument by @tomaarsen in #3101 - revert float type cast manually in BinaryClassificationEvaluator by @JINO-ROHIT in #3102
- update train_sts_seed_optimization with SentenceTransformerTrainer by @JINO-ROHIT in #3092
- Fix cross encoder device issue by @susnato in #3104
- [
enhancement
] Make MultipleNegativesRankingLoss easier to understand by @tomaarsen in #3100 - [
fix
] Fix breaking change in PyLate when loading modules by @tomaarsen in #3110 - multi-GPU support for mine_hard_negatives by @alperctnkaya in #2967
- raises error when dataset is an empty list in NanoBEIREvaluator by @JINO-ROHIT in #3122
- Added a note to the documentation stating that the similarity method does not support embeddings other than non-quantized ones. by @pesuchin in #3131
- [
typo
] Add missing space between sentences in error message by @tomaarsen in #3125 - raises ValueError when num_label !=1 when using Crossencoder.rank() by @JINO-ROHIT in #3126
- fix backward pass for cached losses by @Marcel256 in #3114
- Adding evaluation checks to prevent Transformer ValueError by @stsfaroz in #3105
- [typo] Fix incorrect spelling for "corpus" by @ignasgr in #3154
- [
fix
] Save custom modulekwargs
if specified by @tomaarsen in #3112 - [
memory
] Avoid storing trainer in ModelCardCallback and SentenceTransformerModelCardData by @tomaarsen in #3144 - Suport for embedded representation by @Radu1999 in #3156
- [DRAFT] tests for nanobeir evaluator by @JINO-ROHIT in #3127
- Update TSDAE examples with SentenceTransformerTrainer by @JINO-ROHIT in #3137
- [
docs
] Update the Static Embedding example snippet by @tomaarsen in #3177 - fix: propagate cache dir to find adapter by @lauralehoczki11 in #3174
- [
fix
] Use HfArgumentParser-compatible typing for prompts by @tomaarsen in #3178 - testcases for community detection by @JINO-ROHIT in #3163
- [
docs
] Add PEFT documentation + training example by @tomaarsen in #3180 - [
tests
] Make TripletEvaluator test more consistent by @tomaarsen in #3183 - [
deprecation
] Clarify that datasets and readers are deprecated since v3 by @tomaarsen in #3184 - [docs] Update the documentation surrounding Matryoshka + Cached losses by @tomaarsen in #3190
New Contributors
- @JINO-ROHIT made their first contribution in #3051
- @Marcel256 made their first contribution in #3068
- @amitport made their first contribution in #3081
- @zivicmilos made their first contribution in #2862
- @susnato made their first contribution in #3104
- @alperctnkaya made their first contribution in #2967
- @stsfaroz made their first contribution in #3105
- @ignasgr made their first contribution in #3154
- @Radu1999 made their first contribution in #3156
- @lauralehoczki11 made their first contribution in #3174
An explicit thanks to @JINO-ROHIT who has made a large amount of contributions in this release.
Full Changelog: v3.3.1...v3.4.0