Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wrong initialization of lr scheduler #256

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

kylematoba
Copy link

The initialization of the learning rate scheduler does not correctly resume. This is because the optimizer state is loaded first, and the lambda in the LR scheduler is partialied to the LR this implies. In this PR I fix(?) it by first initializing both the optimizer and the LR scheduler, then load the state dict for each.

So, for example, when you save the state dict at a step you have

lr_scheduler.lr_lambdas[0]
functools.partial(<function lr_scheduler_builder.<locals>.lr_lambda at 0x7f6e1c539fc0>, initial_lr=0.0003)

but when you reload it to resume you have

lr_scheduler.lr_lambdas[0]
functools.partial(<function lr_scheduler_builder.<locals>.lr_lambda at 0x7f00c053b760>, initial_lr=1e-05)

Obviously, using a totally wrong LR schedule makes resumption useless.

It seems to work, except for the 15 step example examples/config_tiny_llama.yaml, where it's close to working, but not exact. I'm assuming this is due to some sort of warm up or something. If it's not clear to you I can look into it further.

Plus a small fix for the code just not running in serialize/main.py.

Copy link
Member

@NouamaneTazi NouamaneTazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! please rebase main and remove lr_scheduler._initial_step() as shown here #243 (comment)

@kylematoba
Copy link
Author

@NouamaneTazi that's done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants