-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use warmup steps for lr scheduler, ban steps == -1 #99
Conversation
train_configs/debug_model.toml
Outdated
@@ -26,7 +26,7 @@ lr = 8e-4 | |||
[training] | |||
batch_size = 8 | |||
seq_len = 2048 | |||
warmup_pct = 0.20 # lr scheduler warm up | |||
warmup_steps = 5 # lr scheduler warm up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - don't we want 2 here if the default steps is 10 (i.e. 20% rather than make it 50%).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a comment here suggesting the pct should be around 20%? A user might have no idea how many steps to warm up when changing total number of training steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good I'll update. Is 20% the rule of thumb here already? I thought this is more like a tuning thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's definitely a tuning thing - I did 20% to try and be safe bc we are doing such short runs by default (10 steps!) while still showing some reasonable learning gains.
A more rigourous approach would be 2/(1−β2) training iterations for the warmup with the current linear scheduling, which is 2000 iterations assuming default AdamW. (see https://arxiv.org/abs/1910.04209).
The issue here is we want to show a reasonable curve in the short term and doing that in 10 steps or even 100 is very different than the more rigorous algo above (which can be simplified to 2k warmup iterations) or even plugged in to the lr if we wanted to get fancy and be more exact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the CI errors:
- apply pre-commit
- modify default training_steps in the test file
- the GPU test failed because HF is down and dataset cannot be downloaded ... I guess we don't expect this to happen a lot? But for test it may make sense to use a fake dataset.
train_configs/debug_model.toml
Outdated
@@ -26,7 +26,7 @@ lr = 8e-4 | |||
[training] | |||
batch_size = 8 | |||
seq_len = 2048 | |||
warmup_pct = 0.20 # lr scheduler warm up | |||
warmup_steps = 5 # lr scheduler warm up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add a comment here suggesting the pct should be around 20%? A user might have no idea how many steps to warm up when changing total number of training steps.
as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler
as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler
as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler
as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler