You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
checkpoint method is suggested in pytorch, and is generally used to reduce runtime memory usage.
It leads to redundant calculations but also reduces memory consumption, and this trade-off can give some options to each application.
To implement the method in nntrainer, several considerations are necessary.
The biggest issue is that nntrainer only supports the pre-calculated memory planning method now. Managed memory area should be planned before the training phase, instant memories could be allocated in layers or calculation methods, but the checkpoint method needs to manage these instant memories more carefully. (It needs to decide when the memory is allocated/freed).
The other issue is related to the policy. It needs to check how we decide which layer needs to be checkpointed. More checkpoint means much memory reduction, also high CPU usage.
The text was updated successfully, but these errors were encountered:
checkpoint
method is suggested in pytorch, and is generally used to reduce runtime memory usage.It leads to redundant calculations but also reduces memory consumption, and this trade-off can give some options to each application.
To implement the method in nntrainer, several considerations are necessary.
The biggest issue is that nntrainer only supports the pre-calculated memory planning method now. Managed memory area should be planned before the training phase, instant memories could be allocated in layers or calculation methods, but the
checkpoint
method needs to manage these instant memories more carefully. (It needs to decide when the memory is allocated/freed).The other issue is related to the policy. It needs to check how we decide which layer needs to be checkpointed. More checkpoint means much memory reduction, also high CPU usage.
The text was updated successfully, but these errors were encountered: