Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use warmup steps for lr scheduler, ban steps == -1 #99

Merged
merged 1 commit into from
Feb 29, 2024
Merged

Conversation

wanchaol
Copy link
Collaborator

as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 28, 2024
@@ -26,7 +26,7 @@ lr = 8e-4
[training]
batch_size = 8
seq_len = 2048
warmup_pct = 0.20 # lr scheduler warm up
warmup_steps = 5 # lr scheduler warm up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - don't we want 2 here if the default steps is 10 (i.e. 20% rather than make it 50%).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a comment here suggesting the pct should be around 20%? A user might have no idea how many steps to warm up when changing total number of training steps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good I'll update. Is 20% the rule of thumb here already? I thought this is more like a tuning thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's definitely a tuning thing - I did 20% to try and be safe bc we are doing such short runs by default (10 steps!) while still showing some reasonable learning gains.
A more rigourous approach would be 2/(1−β2) training iterations for the warmup with the current linear scheduling, which is 2000 iterations assuming default AdamW. (see https://arxiv.org/abs/1910.04209).

The issue here is we want to show a reasonable curve in the short term and doing that in 10 steps or even 100 is very different than the more rigorous algo above (which can be simplified to 2k warmup iterations) or even plugged in to the lr if we wanted to get fancy and be more exact.

Copy link
Contributor

@tianyu-l tianyu-l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the CI errors:

  1. apply pre-commit
  2. modify default training_steps in the test file
  3. the GPU test failed because HF is down and dataset cannot be downloaded ... I guess we don't expect this to happen a lot? But for test it may make sense to use a fake dataset.

@@ -26,7 +26,7 @@ lr = 8e-4
[training]
batch_size = 8
seq_len = 2048
warmup_pct = 0.20 # lr scheduler warm up
warmup_steps = 5 # lr scheduler warm up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add a comment here suggesting the pct should be around 20%? A user might have no idea how many steps to warm up when changing total number of training steps.

as titled, we don't want to allow steps == -1 case as it would blow up
the lr scheduler
@wanchaol wanchaol merged commit 6d5a71e into main Feb 29, 2024
4 checks passed
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
as titled, we don't want to allow steps == -1 case as it would blow up
the lr scheduler
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
as titled, we don't want to allow steps == -1 case as it would blow up
the lr scheduler
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants