-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unstable Training Loss and Model Evaluation #3041
Comments
@nguyenvannghiem0312 |
Hello @pesuchin , sorry I missed the notification. I used the CachedMNRL, MNRL loss functions, but encountered issues with both. I discovered that when there are only anchor and positive samples, training is stable; however, when I add negatives with the following code, it becomes unstable as shown in the image: train_datasets = read_json_or_dataset(config["train_path"])
train_datasets = process_data(train=train_datasets, number_negatives=config['number_negatives'])
datasets = {}
anchor = [config["query_prompt"] + item["anchor"] for item in train_datasets]
positive = [config["corpus_prompt"] + item["positive"] for item in train_datasets]
datasets["anchor"] = anchor
datasets["positive"] = positive
if "negative" in train_datasets[0] and config["is_triplet"] == True:
negative = [config["corpus_prompt"] + item["negative"] for item in train_datasets]
datasets["negative"] = negative
return Dataset.from_dict(datasets) When I adjusted the code to add negatives as shown below, the training process stabilized: datasets = []
for item in train_datasets:
sample = {
'anchor': config["query_prompt"] + item["anchor"],
'positive': config["corpus_prompt"] + item["positive"]
}
if config["is_triplet"] == True:
for idx in range(config['number_negatives']):
sample[f'negative_{idx}'] = config["corpus_prompt"] + item[f'negative_{idx}']
datasets.append(sample)
datasets = Dataset.from_list(datasets) I'm not sure if there was an error in my initial code, but I've resolved the issue with the updated code. |
Oh @pesuchin , I just realized that the issue is not only from the negative samples, but also from the batch size of CacheMNRL. When I train with a batch size of 128 (with 3 negative samples per query), the loss is stable. But when using a batch size of 512 (with 3 negative samples per query), it becomes unstable, as shown in the image below. However, when I train with a batch size of 4096 (without using negatives), the process is mostly stable... though I did notice a slight instability here So, the issue comes from both the negative samples and the batch size. |
Although there is some instability here, the results at the best checkpoint are still quite good. However, this loss graph is not ideal for including in the report. |
Tks @tomaarsen for your response.
I also considered cases of misclassification with negative samples (I used BM25 to mine negative samples) and used the GIST method to eliminate that (v7 v7 is the model in which I used GIST with the same parameters.), which seemed to have a slight impact. Although the loss is still not fully stable, its instability has lessened somewhat. |
Hmm, that learning rate looks correct indeed. |
During the training of the
multilingual-E5-base
model, I encountered an unstable loss pattern. Previously, I trained and had a stable loss function. I tried changing the model but encountered a similar issue. Could you help me understand what this problem might be?The text was updated successfully, but these errors were encountered: