Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

静态图AMP O2策略load checkpoint有bug #39050

Closed
sneaxiy opened this issue Jan 19, 2022 · 3 comments
Closed

静态图AMP O2策略load checkpoint有bug #39050

sneaxiy opened this issue Jan 19, 2022 · 3 comments
Assignees

Comments

@sneaxiy
Copy link
Collaborator

sneaxiy commented Jan 19, 2022

为使您的问题得到快速解决,在建立Issues前,请您先通过如下方式搜索是否有相似问题:【搜索issue关键字】【使用labels筛选】【官方文档】

如果您没有查询到相似问题,为快速解决您的提问,建立issue时请提供如下细节信息:

  • 标题:简洁、精准概括您的问题,例如“Insufficient Memory xxx" ”
  • 版本、环境信息:
       1)PaddlePaddle版本:develop
       2)CPU:无
       3)GPU:无
       4)系统环境:无
    注:您可以通过执行summary_env.py获取以上信息。
  • 训练信息
       1)单机/多机,单卡/多卡
       2)显存信息
       3)Operator信息
  • 复现信息:无
  • 问题描述:

在静态图AMP O2策略中,startup program里会插入一个cast op来把param转成FP32的master param。假如我现在有个checkpoint想加载到模型里,无论在amp_init前还是后load checkpoint,都会有bug:

  • 如果amp_init在load checkpoint之前:
exe.run(startup_program) # 此时,param和master param都是FP32
load_checkpoint(...) # 此时,param会被覆盖成checkpoint的值。但master param没改,跟param不同步
optimizer.amp_init() # 此时,param会被cast成FP16,master param和param的值不同步
  • 如果amp_init在load checkpoint之后
exe.run(startup_program) # 此时,param和master param都是FP32
optimizer.amp_init() # 此时,param会被cast成FP16,master param和param的值是同步的
load_checkpoint(...) # 此时,param会被覆盖成checkpoint的值。但master param没改,跟param不同步

由此可见,无论load checkpoint放在什么位置,都是错的,无法正常加载checkpoint。

@paddle-bot-old
Copy link

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

@zhiqiu
Copy link
Contributor

zhiqiu commented Jan 21, 2022

paddle静态图AMP O2训练产生的checkpoint,目前已支持save master weight,可以使用如下方法:

loss = net()
optimizer = paddle.static.amp.decorate(opt)
optimizer.minimize(loss)
exe.run(start_program)
optimizer.amp_init()
for i in range(10):
    exe.run(main_program)
paddle.save(main_program.state_dict())
state_dict_load = paddle.load(path)
main_program.set_state_dict(state_dict_load)

动态图目前尚未支持save master weight,已有计划支持。#39121

@paddle-bot
Copy link

paddle-bot bot commented Jan 31, 2023

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants