-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VM][DMLC] Lower memory usage when loading and dumping weights #13877
[VM][DMLC] Lower memory usage when loading and dumping weights #13877
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.
Generated by tvm-bot |
The approach of having overload file support util is fine, one thing is that it would needs to be part of the runtime folder as it is simple enough. Given most of the cases are on GPU, having ability to be able to load one array into CPU, copy into GPU then immediately free that CPU array can also be effective. |
@tqchen thanks for the comments. PTAL, ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AndrewZhaoLuo , one minor comment
* initial commit * update additional use cases * typo * asf header, summary * clean up * lint * move code to src/runtime/file_utils.h * file utils is cool
Right now there is a bad pattern in VM executable where when loading weights, we load serialized representation in memory, and then deserialize off the in-memory store without progressively freeing memory.
This is bad because if our weights take up ~ 5GB, then the serialized representation in memory takes up 5GB and the deserialized representation will take ~ 5 GB too. This means peak memory use for using the VM for execution is 2 * the size of the weight models.
This is bad, especially with some of the larger models out there today.
This fixes thing by using a stream from disk, and depending on the standard C file interface to buffer things for performant results.
Some before and after graphs though loading and benchmarking a model with ~5GB weights:
Before:
After:
This is a draft since: