Skip to content

v3.4.0 - Resolved memory leak when deleting a model & trainer; add Matryoshka & Cached loss compatibility; small features & bug fixes

Latest
Compare
Choose a tag to compare
@tomaarsen tomaarsen released this 23 Jan 15:21
· 6 commits to master since this release

This release resolves a memory leak when deleting a model & trainer, adds compatibility between the Cached... losses and the Matryoshka loss modifier, resolves numerous bugs, and adds several small features.

Install this version with

# Training + Inference
pip install sentence-transformers[train]==3.4.0

# Inference only, use one of:
pip install sentence-transformers==3.4.0
pip install sentence-transformers[onnx-gpu]==3.4.0
pip install sentence-transformers[onnx]==3.4.0
pip install sentence-transformers[openvino]==3.4.0

Matryoshka & Cached loss compatibility (#3068, #3107)

It is now possible to combine the strong Cached losses (CachedMultipleNegativesRankingLoss, CachedGISTEmbedLoss, CachedMultipleNegativesSymmetricRankingLoss) with the Matryoshka loss modifier:

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base")
train_dataset = Dataset.from_dict({
    "anchor": ["It's nice weather outside today.", "He drove to work."],
    "positive": ["It's so sunny.", "He took the car to the office."],
})
loss = losses.CachedMultipleNegativesRankingLoss(model, mini_batch_size=16)
loss = losses.MatryoshkaLoss(model, loss, [768, 512, 256, 128, 64])

trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

See for example tomaarsen/mpnet-base-gooaq-cmnrl-mrl which was trained with CachedMultipleNegativesRankingLoss (CMNRL) with the Matryoshka loss modifier (MRL).

Resolve memory leak when Model and Trainer are reinitialized (#3144)

Due to a circular dependency in the SentenceTransformerTrainer -> SentenceTransformer -> SentenceTransformerModelCardData -> SentenceTransformerTrainer, deleting the trainer and model still doesn't clear them up via garbage disposal. I've moved a lot of components around, and now SentenceTransformerModelCardData does not need to store the SentenceTransformerTrainer, breaking the cycle.

We ran the seed optimization script (which frequently creates and deletes models and trainers):

  • Before: Approximate highest recorded VRAM:
    16332MiB /  24576MiB
    
  • After: Approximate highest recorded VRAM:
    8222MiB /  24576MiB
    

Small Features

  • Add Matthews Correlation Coefficient to the BinaryClassificationEvaluator in #3051.
  • Add a triplet margin parameter to the TripletEvaluator in #2862.
  • Put dataset information in the automatically generated model card in "expanding sections" blocks if there are many datasets in #3088.
  • Add multi-GPU (and CPU multi-process) support for mine_hard_negatives in #2967.

Notable Bug Fixes

  • Subsequent batches were identical when using the no_duplicates Batch Sampler (#3069). This has been resolved in #3073
  • The old-style model.fit() training with write_csv on an evaluator would crash (#3062). This has been resolved in #3066.
  • The output types of some evaluators were np.float instead of float (#3075). This has been resolved in #3076 and #3096.
  • It was not possible to specify a revision or cache_dir when loading a PEFT Adapter model (#3061). This has been resolved in #3079 and #3174.
  • The CrossEncoder was lazily placed on the incorrect device, did not respond to model.to (#3078). This has been resolved in #3104.
  • If a model used a custom module with custom kwargs, those kwargs keys were not saved in modules.json correctly, e.g. relevant for jina-embeddings-v3 (#3111). This has been resolved in #3112.
  • HfArgumentParser(SentenceTransformerTrainingArguments) would crash due to prompts typing (#3090). This has been resolved in #3178.

Example Updates

  • Update the quantization script in #3070.
  • Update the seed optimization script in #3092.
  • Update the TSDAE scripts in #3137.
  • Add PEFT Adapter script in #3180.

Documentation Updates

All Changes

  • [training] Pass steps/epoch/output_path to Evaluator during training by @tomaarsen in #3066
  • [examples] Update the quantization script by @tomaarsen in #3070
  • [fix] Fix different batches per epoch in NoDuplicatesBatchSampler by @tomaarsen in #3073
  • [docs] Add links to backend-export in Speeding up Inference by @tomaarsen in #3071
  • add MCC to BinaryClassificationEvaluator by @JINO-ROHIT in #3051
  • support cached losses in combination with matryoshka loss by @Marcel256 in #3068
  • align model_card_templates.py with code by @amitport in #3081
  • converting np float result to float in binary classification evaluator by @JINO-ROHIT in #3076
  • Add triplet margin for distance functions in TripletEvaluator by @zivicmilos in #2862
  • [model_card] Keep the model card readable even with many datasets by @tomaarsen in #3088
  • [docs] Add NanoBEIR to the Training Overview evaluators by @tomaarsen in #3089
  • [fix] revision of the adapter model can now be specified. by @pesuchin in #3079
  • [docs] Update from Sphinx==3.5.4 to 8.1.3, recommonmark -> myst-parser by @tomaarsen in #3099
  • normalize to float in NanoBEIREvaluator, InformationRetrievalEvaluator, MSEEvaluator by @JINO-ROHIT in #3096
  • [docs] List 'prompts' as a key training argument by @tomaarsen in #3101
  • revert float type cast manually in BinaryClassificationEvaluator by @JINO-ROHIT in #3102
  • update train_sts_seed_optimization with SentenceTransformerTrainer by @JINO-ROHIT in #3092
  • Fix cross encoder device issue by @susnato in #3104
  • [enhancement] Make MultipleNegativesRankingLoss easier to understand by @tomaarsen in #3100
  • [fix] Fix breaking change in PyLate when loading modules by @tomaarsen in #3110
  • multi-GPU support for mine_hard_negatives by @alperctnkaya in #2967
  • raises error when dataset is an empty list in NanoBEIREvaluator by @JINO-ROHIT in #3122
  • Added a note to the documentation stating that the similarity method does not support embeddings other than non-quantized ones. by @pesuchin in #3131
  • [typo] Add missing space between sentences in error message by @tomaarsen in #3125
  • raises ValueError when num_label !=1 when using Crossencoder.rank() by @JINO-ROHIT in #3126
  • fix backward pass for cached losses by @Marcel256 in #3114
  • Adding evaluation checks to prevent Transformer ValueError by @stsfaroz in #3105
  • [typo] Fix incorrect spelling for "corpus" by @ignasgr in #3154
  • [fix] Save custom module kwargs if specified by @tomaarsen in #3112
  • [memory] Avoid storing trainer in ModelCardCallback and SentenceTransformerModelCardData by @tomaarsen in #3144
  • Suport for embedded representation by @Radu1999 in #3156
  • [DRAFT] tests for nanobeir evaluator by @JINO-ROHIT in #3127
  • Update TSDAE examples with SentenceTransformerTrainer by @JINO-ROHIT in #3137
  • [docs] Update the Static Embedding example snippet by @tomaarsen in #3177
  • fix: propagate cache dir to find adapter by @lauralehoczki11 in #3174
  • [fix] Use HfArgumentParser-compatible typing for prompts by @tomaarsen in #3178
  • testcases for community detection by @JINO-ROHIT in #3163
  • [docs] Add PEFT documentation + training example by @tomaarsen in #3180
  • [tests] Make TripletEvaluator test more consistent by @tomaarsen in #3183
  • [deprecation] Clarify that datasets and readers are deprecated since v3 by @tomaarsen in #3184
  • [docs] Update the documentation surrounding Matryoshka + Cached losses by @tomaarsen in #3190

New Contributors

An explicit thanks to @JINO-ROHIT who has made a large amount of contributions in this release.

Full Changelog: v3.3.1...v3.4.0