-
Notifications
You must be signed in to change notification settings - Fork 862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leak in xml.cc. #322
Conversation
This patch fixes the memory leak documented in NVIDIA#321, where one of the buffers allocated by realpath(), inside getPciPath() is not freed. The memory management aspect of this function also seemed odd and unecessary, as the realpath() function is documented to only write up to PATH_MAX bytes to the buffer passed to it, meaning we don't need dynamic memory allocation at all. I also changed the function signature of getPciPath to enforce the use of a fixed-size buffer.
Thanks. Good point on buffer sizes, there are other places where we use realpath that way and which we might want to change as well (although they don't cause mem leaks). One issue with the proposed change is that it would call getPciPath for each key=value we read. The |
Reduce the number of getPciPath calls to a single one per invocation and split the function in two so that the large `path` buffer does not linger the in the stack during recursive calls.
Done. While doing that, I also looked more closely and realized that ncclTopoGetXmlFromSys is a recursive function. Since MAX_PATH is 4096 in most nix* systems, that's a bit too heavy if the function can recurse any meaningful number of times, so I split the non-recursive part into a separate function so that the stack resources could be reclaimed. There were two other alternatives there: Wrapping the path buffer in a scope and making the function tail-recursive. The first solution ended up looking really weird and counter-intuitive, the second is too brittle and I'm worried it might get broken in the future silently. |
Well, the recursion just goes up the path. Each step corresponding to a PCI switch (a real piece of hardware) I doubt it will go to extreme sizes. That said, after a second thought, couldn't we have solved that whole issue adding a single |
We could, but I also decided to take a shot at making the memory management simpler to avoid this happening again in the future. That sort of backfired when I noticed the function was recursive, so I'd be fine with just reverting that and free-ing the leaked memory, up to you. Regarding the stack size, some environments run with pretty small stack sizes, like 16, or 64k. In those cases, keeping multiple 4k stack allocations would already cause problems. |
What is the current status on the leak, has this been fixed in v2.7.6? Thanks! |
Looks like it wasn't. As most of the changes were in So I believe all we need now is :
@capcah is that correct ? |
Correct |
This has been fixed for some time now. Closing. |
This patch fixes the memory leak documented in
#321, where one of the buffers
allocated by realpath(), inside getPciPath() is not freed.
The memory management aspect of this function also seemed odd and
unecessary, as the realpath() function is documented to only write up to
PATH_MAX bytes to the buffer passed to it, meaning we don't need dynamic
memory allocation at all. I also changed the function signature of
getPciPath to enforce the use of a fixed-size buffer.