Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory leak in xml.cc. #322

Closed
wants to merge 2 commits into from
Closed

Fix memory leak in xml.cc. #322

wants to merge 2 commits into from

Conversation

capcah
Copy link

@capcah capcah commented Apr 14, 2020

This patch fixes the memory leak documented in
#321, where one of the buffers
allocated by realpath(), inside getPciPath() is not freed.

The memory management aspect of this function also seemed odd and
unecessary, as the realpath() function is documented to only write up to
PATH_MAX bytes to the buffer passed to it, meaning we don't need dynamic
memory allocation at all. I also changed the function signature of
getPciPath to enforce the use of a fixed-size buffer.

This patch fixes the memory leak documented in
NVIDIA#321, where one of the buffers
allocated by realpath(), inside getPciPath() is not freed.

The memory management aspect of this function also seemed odd and
unecessary, as the realpath() function is documented to only write up to
PATH_MAX bytes to the buffer passed to it, meaning we don't need dynamic
memory allocation at all. I also changed the function signature of
getPciPath to enforce the use of a fixed-size buffer.
@sjeaugey
Copy link
Member

Thanks. Good point on buffer sizes, there are other places where we use realpath that way and which we might want to change as well (although they don't cause mem leaks).

One issue with the proposed change is that it would call getPciPath for each key=value we read. The if (path == NULL) was there to only call getPciPath once. So we might want to call it once at the beginning unconditionally (provided it will always work) instead. That might be sub-optimal if we end up not needing it, but it might be better than calling it 4 times.

Reduce the number of getPciPath calls to a single one per invocation
and split the function in two so that the large `path` buffer does
not linger the in the stack during recursive calls.
@capcah
Copy link
Author

capcah commented Apr 14, 2020

Done. While doing that, I also looked more closely and realized that ncclTopoGetXmlFromSys is a recursive function. Since MAX_PATH is 4096 in most nix* systems, that's a bit too heavy if the function can recurse any meaningful number of times, so I split the non-recursive part into a separate function so that the stack resources could be reclaimed.

There were two other alternatives there: Wrapping the path buffer in a scope and making the function tail-recursive. The first solution ended up looking really weird and counter-intuitive, the second is too brittle and I'm worried it might get broken in the future silently.

@sjeaugey
Copy link
Member

Well, the recursion just goes up the path. Each step corresponding to a PCI switch (a real piece of hardware) I doubt it will go to extreme sizes.

That said, after a second thought, couldn't we have solved that whole issue adding a single free(pciPath); line ?

@capcah
Copy link
Author

capcah commented Apr 14, 2020

We could, but I also decided to take a shot at making the memory management simpler to avoid this happening again in the future. That sort of backfired when I noticed the function was recursive, so I'd be fine with just reverting that and free-ing the leaked memory, up to you.

Regarding the stack size, some environments run with pretty small stack sizes, like 16, or 64k. In those cases, keeping multiple 4k stack allocations would already cause problems.

@chsigg
Copy link
Contributor

chsigg commented Jun 30, 2020

What is the current status on the leak, has this been fixed in v2.7.6? Thanks!

@sjeaugey
Copy link
Member

Looks like it wasn't. As most of the changes were in ncclTopoGetXmlFromSys, I got confused and not sure what the fix was since there was a free(path) at the end of the function. But in fact all that was to fix the other call to getPciPath later in the NVLink class detection.

So I believe all we need now is :

diff --git a/src/graph/xml.cc b/src/graph/xml.cc
index cc91b925..cb0125c1 100644
--- a/src/graph/xml.cc
+++ b/src/graph/xml.cc
@@ -647,6 +647,7 @@ ncclResult_t ncclTopoGetXmlFromGpu(struct ncclXmlNode* pciNode, nvmlDevice_t nvm
         char* path;
         NCCLCHECK(getPciPath(busId, &path));
         NCCLCHECK(ncclTopoSetAttrFromSys(sub, path, "class", "tclass"));
+        free(path);
       }
     }
   }

@capcah is that correct ?

@capcah
Copy link
Author

capcah commented Jul 1, 2020

Correct

@sjeaugey
Copy link
Member

This has been fixed for some time now. Closing.

@sjeaugey sjeaugey closed this Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants