Unstable Training Loss and Model Evaluation #3041

nguyenvannghiem0312 · 2024-11-07T14:10:16Z

During the training of the multilingual-E5-base model, I encountered an unstable loss pattern. Previously, I trained and had a stable loss function. I tried changing the model but encountered a similar issue. Could you help me understand what this problem might be?

The text was updated successfully, but these errors were encountered:

pesuchin · 2024-11-12T03:10:30Z

@nguyenvannghiem0312
Thanks for posting the issue!
Please let me know for reference which loss class caused the instability.

nguyenvannghiem0312 · 2024-11-12T10:35:34Z

Hello @pesuchin , sorry I missed the notification. I used the CachedMNRL, MNRL loss functions, but encountered issues with both. I discovered that when there are only anchor and positive samples, training is stable; however, when I add negatives with the following code, it becomes unstable as shown in the image:

train_datasets = read_json_or_dataset(config["train_path"])
train_datasets = process_data(train=train_datasets, number_negatives=config['number_negatives'])
datasets = {}
anchor = [config["query_prompt"] + item["anchor"] for item in train_datasets]
positive = [config["corpus_prompt"] + item["positive"] for item in train_datasets]

datasets["anchor"] = anchor
datasets["positive"] = positive
if "negative" in train_datasets[0] and config["is_triplet"] == True:
    negative = [config["corpus_prompt"] + item["negative"] for item in train_datasets]
    datasets["negative"] = negative
return Dataset.from_dict(datasets)

When I adjusted the code to add negatives as shown below, the training process stabilized:

datasets = []
for item in train_datasets:
    sample = {
        'anchor': config["query_prompt"] + item["anchor"],
        'positive': config["corpus_prompt"] + item["positive"]
    }
    if config["is_triplet"] == True:
        for idx in range(config['number_negatives']):
            sample[f'negative_{idx}'] = config["corpus_prompt"] + item[f'negative_{idx}']

    datasets.append(sample)
datasets = Dataset.from_list(datasets)

I'm not sure if there was an error in my initial code, but I've resolved the issue with the updated code.

nguyenvannghiem0312 · 2024-11-14T01:41:30Z

Oh @pesuchin , I just realized that the issue is not only from the negative samples, but also from the batch size of CacheMNRL. When I train with a batch size of 128 (with 3 negative samples per query), the loss is stable.

But when using a batch size of 512 (with 3 negative samples per query), it becomes unstable, as shown in the image below.

However, when I train with a batch size of 4096 (without using negatives), the process is mostly stable... though I did notice a slight instability here

So, the issue comes from both the negative samples and the batch size.
I also noticed one more thing: each time training becomes unstable, the "train/epoch" graph also becomes unstable (it’s not a straight line).

nguyenvannghiem0312 · 2024-11-14T01:43:15Z

Although there is some instability here, the results at the best checkpoint are still quite good. However, this loss graph is not ideal for including in the report.

tomaarsen · 2024-11-14T08:54:13Z

Wow, those are very fascinating results.
Could you use batch_sampler="no_duplicates" perhaps? This will prevent identical texts within a batch, which helps remove false negatives in the in-batch negatives. That might be a reason, as a specific text can then both be a positive and a negative for an anchor, which is confusing for the model.

Additionally, would you describe your negatives as very hard? I can imagine that very hard negatives might result in odd behaviour, as the model has trouble distinguishing between the positive and the very hard negatives.

Oh! I just noticed the issue, I believe: the epoch figure is odd.

It looks like, in the 512 (with 3 negative samples per query) case, you're training with about 900 steps in the first epoch, but the trainer thinks that should correspond with ~3.5 epochs. Once the epoch is completed, the trainer sets the epoch to 1, and keeps going. This results in a very problematic epoch figure, which presumably affects the learning rate. Could you verify if the learning rate has an odd figure as well?

I think the 0 loss that you're seeing sometimes is because the learning rate has been reduced to 0, and then the loss spikes a ton when the learning rate is suddenly a normal number again.

Could you share a bit more about your hyperparameters/training arguments?

Tom Aarsen

nguyenvannghiem0312 · 2024-11-14T09:16:13Z

Tks @tomaarsen for your response.
Here are the hyperparameters I used:

{
    "model": model,
    "guide_model": guide_model,
    "max_length": 1022,
    "query_prompt": "query: ",
    "corpus_prompt": "passage: ",
    "is_triplet": true,
    "number_negatives": 3,
    "loss": "CachedMultipleNegativesRankingLoss",
    "batch_size": 512,
    "mini_batch_size": 16,
    "num_train_epochs": 10,
    "warmup_ratio": 0.05,
    "fp16": false,
    "bf16": true,
    "batch_sampler": "NO_DUPLICATES",
    "eval_strategy": "steps",
    "eval_steps": 10,
    "save_strategy": "steps",
    "save_steps": 10,
    "save_total_limit": 2,
    "logging_steps": 1,
    "load_best_model_at_end": true,
    "metric_for_best_model": "eval_cosine_mrr@10",
    "learning_rate": 5e-5,
}

I also considered cases of misclassification with negative samples (I used BM25 to mine negative samples) and used the GIST method to eliminate that (v7 v7 is the model in which I used GIST with the same parameters.), which seemed to have a slight impact. Although the loss is still not fully stable, its instability has lessened somewhat.

and the learning rate curve is very stable.

tomaarsen · 2024-11-14T09:44:59Z

Hmm, that learning rate looks correct indeed.

nguyenvannghiem0312 mentioned this issue Dec 25, 2024

SentenceTransformerTrainer reset global step around step 127421 with no warning message #3145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unstable Training Loss and Model Evaluation #3041

Unstable Training Loss and Model Evaluation #3041

nguyenvannghiem0312 commented Nov 7, 2024

pesuchin commented Nov 12, 2024

nguyenvannghiem0312 commented Nov 12, 2024

nguyenvannghiem0312 commented Nov 14, 2024

nguyenvannghiem0312 commented Nov 14, 2024

tomaarsen commented Nov 14, 2024

nguyenvannghiem0312 commented Nov 14, 2024 •

edited

Loading

tomaarsen commented Nov 14, 2024

Unstable Training Loss and Model Evaluation #3041

Unstable Training Loss and Model Evaluation #3041

Comments

nguyenvannghiem0312 commented Nov 7, 2024

pesuchin commented Nov 12, 2024

nguyenvannghiem0312 commented Nov 12, 2024

nguyenvannghiem0312 commented Nov 14, 2024

nguyenvannghiem0312 commented Nov 14, 2024

tomaarsen commented Nov 14, 2024

nguyenvannghiem0312 commented Nov 14, 2024 • edited Loading

tomaarsen commented Nov 14, 2024

nguyenvannghiem0312 commented Nov 14, 2024 •

edited

Loading