add learning rate visualisation and manual parameter #161

lfoppiano · 2023-05-08T23:49:13Z

In line with the use of the incremental training, knowing the final learning rate of the "previous training", and the ability to set it manually can be helpful

This PR (updated list):

adds a new parameter --learning-rate to override the default learning rate value to the *Tagging applications
removes all the hard-coded learning rates and set the default values as discussed in add learning rate visualisation and manual parameter #161 (comment) :
- transformers (2e-5) and
- RNN (0.0001)
add a callback that prints the learning rate at the end of each epoch

kermitt2 · 2023-05-12T15:23:57Z

Hi Luca !

I realize that the default learning rate should depend on the architecture and that was not well done - low learning rates like 0.0001 or lower are typical for for BERT, but RNN need much higher value. So the older default (0.001) was too high for BERT I think, but this new default one (0.0001) in the PR is now too low for RNN.

It's definitively useful to add it as command line parameter, but I think we should set the default learning rate in the configure() functions for the different application depending on the selected architecture.

kermitt2 · 2023-05-12T15:29:10Z

Ok I double check: for both sequence labeling and text classification the learning rate for all transformer architectures is hard coded at init_lr=2e-5 in the decay optimizer (this is the usual value). It's not using the config value.

Only RNN models were using the config learning rate value, and the default (0.001) was set for this.

kermitt2 · 2023-05-12T15:41:15Z

So that was my assumption when I added the decay optimizers:

for transformers we always use 2e-5 as learning rate because everybody uses that value and we don't want to change it (I remember vaguely having tested 1e-5 but it was very slightly worse and higher values are not recommended because they make the model more "forgetting" some training examples).
for RNN models, changing the learning rate is more usual, so it uses the config value.

…ecture

lfoppiano · 2023-05-12T23:44:54Z

Thanks for the clarification.
I think having the configurable parameter could be useful for example to lower it for incremental training.
I propose the following:

we fetch the data from the trainingconfig (which will then print the right value at startup)
I set the default value in wrapper to None and reset the default in the constructor based on the fact that it's a transformer or not

We can set the value also in the application, but at least we dont' risk to run it with the wrong default value.

Let me know if this makes sense

…ferent default for transformers-based and non-transformers-based models

lfoppiano · 2023-06-15T02:22:20Z

I've fixed the default values (also in the classification trainer).

I've added a callback that prints the LR decayed at each epoch, however I have the following:

---
max_epoch: 60
early_stop: True
patience: 5
batch_size (training): 80
max_sequence_length: 30
model_name: grobid-date-BERT
learning_rate:  2e-05
use_ELMo:  False
---
[...]
__________________________________________________________________________________________________
Epoch 1/60
8/8 [==============================] - ETA: 0s - loss: 2.0593	f1 (micro): 47.24
8/8 [==============================] - 69s 8s/step - loss: 2.0593 - f1: 0.4724 - learning_rate: 3.8095e-06
Epoch 2/60
8/8 [==============================] - ETA: 0s - loss: 1.2964	f1 (micro): 82.82
8/8 [==============================] - 43s 4s/step - loss: 1.2964 - f1: 0.8282 - learning_rate: 7.6190e-06
Epoch 3/60
8/8 [==============================] - ETA: 0s - loss: 0.6858	f1 (micro): 87.61
8/8 [==============================] - 29s 4s/step - loss: 0.6858 - f1: 0.8761 - learning_rate: 1.1429e-05
Epoch 4/60
8/8 [==============================] - ETA: 0s - loss: 0.3628	f1 (micro): 92.73
8/8 [==============================] - 29s 4s/step - loss: 0.3628 - f1: 0.9273 - learning_rate: 1.5238e-05
Epoch 5/60
8/8 [==============================] - ETA: 0s - loss: 0.1840	f1 (micro): 94.89
8/8 [==============================] - 15s 2s/step - loss: 0.1840 - f1: 0.9489 - learning_rate: 1.9048e-05
Epoch 6/60
8/8 [==============================] - ETA: 0s - loss: 0.1167	f1 (micro): 94.61
8/8 [==============================] - 25s 3s/step - loss: 0.1167 - f1: 0.9461 - learning_rate: 1.9683e-05
Epoch 7/60
8/8 [==============================] - ETA: 0s - loss: 0.0769	f1 (micro): 94.89
8/8 [==============================] - 23s 3s/step - loss: 0.0769 - f1: 0.9489 - learning_rate: 1.9259e-05
Epoch 8/60
8/8 [==============================] - ETA: 0s - loss: 0.0656	f1 (micro): 95.50
8/8 [==============================] - 23s 3s/step - loss: 0.0656 - f1: 0.9550 - learning_rate: 1.8836e-05
Epoch 9/60
8/8 [==============================] - ETA: 0s - loss: 0.0562	f1 (micro): 95.50
8/8 [==============================] - 36s 5s/step - loss: 0.0562 - f1: 0.9550 - learning_rate: 1.8413e-05
Epoch 10/60
8/8 [==============================] - ETA: 0s - loss: 0.0514	f1 (micro): 96.10
8/8 [==============================] - 21s 3s/step - loss: 0.0514 - f1: 0.9610 - learning_rate: 1.7989e-05
Epoch 11/60
8/8 [==============================] - ETA: 0s - loss: 0.0424	f1 (micro): 96.70
8/8 [==============================] - 19s 2s/step - loss: 0.0424 - f1: 0.9670 - learning_rate: 1.7566e-05
Epoch 12/60
8/8 [==============================] - ETA: 0s - loss: 0.0348	f1 (micro): 96.70
8/8 [==============================] - 17s 2s/step - loss: 0.0348 - f1: 0.9670 - learning_rate: 1.7143e-05
Epoch 13/60
8/8 [==============================] - ETA: 0s - loss: 0.0348	f1 (micro): 96.70
8/8 [==============================] - 38s 5s/step - loss: 0.0348 - f1: 0.9670 - learning_rate: 1.6720e-05
Epoch 14/60
8/8 [==============================] - ETA: 0s - loss: 0.0294	f1 (micro): 96.70
8/8 [==============================] - 36s 5s/step - loss: 0.0294 - f1: 0.9670 - learning_rate: 1.6296e-05
Epoch 15/60
8/8 [==============================] - ETA: 0s - loss: 0.0244	f1 (micro): 96.10
8/8 [==============================] - 19s 2s/step - loss: 0.0244 - f1: 0.9610 - learning_rate: 1.5873e-05
Epoch 16/60
8/8 [==============================] - ETA: 0s - loss: 0.0251	f1 (micro): 96.10
8/8 [==============================] - 12s 1s/step - loss: 0.0251 - f1: 0.9610 - learning_rate: 1.5450e-05
training runtime: 457.32 seconds 
model config file saved
preprocessor saved
transformer config saved
transformer tokenizer saved
model saved

The initial learning rate is 2e-05 (0.00002), but 🤔 🤔 it seems that in the first epochs it's floating up and down, before decreasing after epoch 6. Is this normal? AFAIK it should go down
Trying to figure out that, I noticed that in the wrapper, there is a parameter lr_decay=0.9 but I'm not able to see it used anyway, so for example for the transformer we have :

optimizer, lr_schedule = create_optimizer(
                init_lr=self.training_config.learning_rate,
                num_train_steps=nb_train_steps,
                weight_decay_rate=0.01,
                num_warmup_steps=0.1 * nb_train_steps,
            )

or, for non transformers, with the Adam optimizer:

lr_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
                initial_learning_rate=self.training_config.learning_rate,
                decay_steps=nb_train_steps,
                decay_rate=0.1)
            optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

For this case should I assume decay_rate = 1- lr_decay?
What about the transformers?

lfoppiano · 2023-07-12T05:43:28Z

By removing the warmup steps, the learning rate does not float around.
Are we sure the warmup steps are necessary for fine-tuning? 🤔

kermitt2 · 2023-07-12T12:09:58Z

Yes normally warm-up is important when fine-tuning with transformers and ELMo (if I remember well, warmup is more important than the decay in learning rate!).

The create_optimizer method that manages learning rate and warmup comes directly from the transformer library and it might work as expected with the up and down. The warmup applies in the first epochs with a lower learning rates to avoid sudden overfitting at the very beginning of the training. So with warmup, LR should start lower than init_lr, and, only after the warmup phase is done, the LR has then the init_lr value.

delft/sequenceLabelling/wrapper.py

delft/textClassification/wrapper.py

add learning rate as parameter

07e291b

lfoppiano requested a review from kermitt2 May 8, 2023 23:49

print correct learning_rate and improve default value based on archit…

33d6436

…ecture

lfoppiano added 7 commits June 15, 2023 08:25

Fix default learning rate from applications scripts

185428b

Remove hardcoded learning rates

22c2ca5

Fix typo

42207e2

Remove default value from learning rate in TrainingConfig and set dif…

a405486

…ferent default for transformers-based and non-transformers-based models

Merge branch 'master' into features/learning_rate_param

7679402

Log learning rate at each epoque

edd1898

cosmetics

64fa67c

lfoppiano changed the title ~~add learning rate as parameter~~ add learning rate visualisation and manual parameter Jun 15, 2023

Merge branch 'master' into features/learning_rate_param

10503b0

lfoppiano mentioned this pull request Jul 12, 2023

Print additional information when training #160

Open

kermitt2 reviewed Jul 15, 2023

View reviewed changes

delft/sequenceLabelling/wrapper.py Outdated Show resolved Hide resolved

delft/textClassification/wrapper.py Outdated Show resolved Hide resolved

fix default learning rate for RNN architectures

5b3ea93

kermitt2 merged commit 2f8976c into master Jul 16, 2023

lfoppiano deleted the features/learning_rate_param branch August 9, 2023 04:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add learning rate visualisation and manual parameter #161

add learning rate visualisation and manual parameter #161

lfoppiano commented May 8, 2023 •

edited

Loading

kermitt2 commented May 12, 2023

kermitt2 commented May 12, 2023

kermitt2 commented May 12, 2023

lfoppiano commented May 12, 2023

lfoppiano commented Jun 15, 2023 •

edited

Loading

lfoppiano commented Jul 12, 2023

kermitt2 commented Jul 12, 2023

add learning rate visualisation and manual parameter #161

add learning rate visualisation and manual parameter #161

Conversation

lfoppiano commented May 8, 2023 • edited Loading

kermitt2 commented May 12, 2023

kermitt2 commented May 12, 2023

kermitt2 commented May 12, 2023

lfoppiano commented May 12, 2023

lfoppiano commented Jun 15, 2023 • edited Loading

lfoppiano commented Jul 12, 2023

kermitt2 commented Jul 12, 2023

lfoppiano commented May 8, 2023 •

edited

Loading

lfoppiano commented Jun 15, 2023 •

edited

Loading