You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
layer_scales = list(layer_decay ** (num_layers - i) for i in range(num_layers + 1)) in line 25 in lr_decay.py.
The elements in "layer_scales" are increasing, so the learning rates are also "the deeper the layer, the greater the learning rate". I printed the learning rate after execute the "lr_sched.adjust_learning_rate" function. It is "the deeper the layer, the greater the learning rate". But shouldn’t the deeper the layer, the smaller the learning rate. I'm so confused. Please answer my questions. Thanks.
The text was updated successfully, but these errors were encountered:
The layers are indexed so that the first block (the one that is closest to the raw input) has index 0, and the last block (the one closest to predicting the logits) has index L - 1. So the later layers do correctly get a larger learning rate.
layer_scales = list(layer_decay ** (num_layers - i) for i in range(num_layers + 1)) in line 25 in lr_decay.py.
The elements in "layer_scales" are increasing, so the learning rates are also "the deeper the layer, the greater the learning rate". I printed the learning rate after execute the "lr_sched.adjust_learning_rate" function. It is "the deeper the layer, the greater the learning rate". But shouldn’t the deeper the layer, the smaller the learning rate. I'm so confused. Please answer my questions. Thanks.
The text was updated successfully, but these errors were encountered: