[bindings] Update memory leak test #22059

rpoyner-tri · 2024-10-17T22:09:10Z

Convert the test to unittest format, rewrite the instrumentation, and track the current numbers of leaks, with the goal of reaching 0 once fixes are landed.

As leaks are fixed, the test will complain about allowing more leaks than actually occurred, prompting update or removal of the leaks allowed.

This change is

rpoyner-tri

+@jwnimmer-tri for feature review.

The results show that some tests miss the mark, and (perhaps) others may be missing. I don't know that we need to address that in this PR. YMMV.

Reviewable status: LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers

jwnimmer-tri

Reviewed 1 of 2 files at r1, all commit messages.
Reviewable status: 2 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers

bindings/pydrake/test/memory_leak_test.py line 117 at r1 (raw file):

                if key != 'self'
                and not any(isinstance(value, typ) for typ in [list, str])}

Everything from class Sentinel up above through here is quite opaque to me, and indeed has few to no comments. We will need a lot more explanatory text in the first 100 lines to hand-hold drake developers through what all is happening.

bindings/pydrake/test/memory_leak_test.py line 299 at r1 (raw file):

                f" leaks_allowed count is stale, please update the test.")

nit Lint is here (blank line).

rpoyner-tri

Reviewable status: 2 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers

bindings/pydrake/test/memory_leak_test.py line 2 at r1 (raw file):

"""Eventually this program might grow up to be an actual regression test for
memory leaks, but for now it merely serves to demonstrate such leaks.

stale

bindings/pydrake/test/memory_leak_test.py line 117 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Everything from class Sentinel up above through here is quite opaque to me, and indeed has few to no comments. We will need a lot more explanatory text in the first 100 lines to hand-hold drake developers through what all is happening.

Done.

rpoyner-tri

Reviewable status: 3 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers

bindings/pydrake/test/memory_leak_test.py line 99 at r2 (raw file):

def _counts_for_cycle_parts(o, name):
    """Print relevant reference counts for the objects expected if `o`
    participates in a cycle created by drake::pydrake::internal::ref_cycle.

anachronism

jwnimmer-tri

Reviewed all commit messages.
Reviewable status: 3 unresolved discussions, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers

bindings/pydrake/test/memory_leak_test.py line 62 at r3 (raw file):

def _object_generation(o) -> int | None:
    """Return the garbage collection generation of the passed object, or None

nit GSG mood.

Ditto throughout this file.

Suggestion:

Returns

bindings/pydrake/test/memory_leak_test.py line 71 at r3 (raw file):

        gen_id_list = [id(x) for x in gen_list]
        if id(o) in gen_id_list:
            return gen

BTW The way this is spelled seems odd to me. It is weird for performance reasons? Seems like this is simpler and more clear:

if any([x is o for x in gen_list]):
    return gen

Code quote:

        gen_id_list = [id(x) for x in gen_list]
        if id(o) in gen_id_list:
            return gen

bindings/pydrake/test/memory_leak_test.py line 128 at r3 (raw file):

            verbose: if True, print additional debug information.
        """
        self._verbose = verbose

I can't understand why this is a member field. It seems like it's only ever changed by hand, by editing the code, and never just for one dut, rather for everything at once?

Under that premise, a test-wide global VERBOSE = False would be much easier to read and understand even just for the if VERBOSE logic, but then would also allow us to split DutRepeater into pieces. Right now it's doing too many jobs at once, and is very difficult to follow and reason about.

Once "verbose" is make into a global, then make_sentinel no longer needs a self, and so it can be hoisted up right next to class Sentinel for cohesiveness. Its helper make_sentinels_from_locals would also move up there for the same reasons (dropping self).

After that, the dut functions don't need to be member functions, so would go back to also being little stateless free functions. Having them be member functions with self._leaks in their scope is very confusing.

After all of that, possibly it makes sense for DutRepeater to just merge into the class TestMemoryLeaks? Hard to say w/o seeing the next draft, but I'm struggling to understand what it's supposed to be encapsulating as a class. Possibly it still makes sense as a standalone thing. LMK.

rpoyner-tri

Reviewable status: 1 unresolved discussion, LGTM missing from assignee jwnimmer-tri(platform), needs at least two assigned reviewers

bindings/pydrake/test/memory_leak_test.py line 71 at r3 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

BTW The way this is spelled seems odd to me. It is weird for performance reasons? Seems like this is simpler and more clear:
if any([x is o for x in gen_list]):
    return gen

I was trying not to bump ref counts of stuff on the verge of GC, but maybe it doesn't matter. We'll try it your way.

bindings/pydrake/test/memory_leak_test.py line 128 at r3 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

I can't understand why this is a member field. It seems like it's only ever changed by hand, by editing the code, and never just for one dut, rather for everything at once?

Under that premise, a test-wide global VERBOSE = False would be much easier to read and understand even just for the if VERBOSE logic, but then would also allow us to split DutRepeater into pieces. Right now it's doing too many jobs at once, and is very difficult to follow and reason about.

Once "verbose" is make into a global, then make_sentinel no longer needs a self, and so it can be hoisted up right next to class Sentinel for cohesiveness. Its helper make_sentinels_from_locals would also move up there for the same reasons (dropping self).

After that, the dut functions don't need to be member functions, so would go back to also being little stateless free functions. Having them be member functions with self._leaks in their scope is very confusing.

After all of that, possibly it makes sense for DutRepeater to just merge into the class TestMemoryLeaks? Hard to say w/o seeing the next draft, but I'm struggling to understand what it's supposed to be encapsulating as a class. Possibly it still makes sense as a standalone thing. LMK.

Done.

jwnimmer-tri

Reviewed 1 of 1 files at r5, all commit messages.
Reviewable status: 7 unresolved discussions, needs at least two assigned reviewers

bindings/pydrake/test/memory_leak_test.py line 163 at r4 (raw file):

    if VERBOSE:
        for key, value in locals().copy().items():
            _counts_for_cycle_parts(value, key)

BTW Using value, key here comes across as a bit unusual, things are typically "key, value" the other way around. Probably the "counts for ..." should require calling by kwargs, and so have name=key, o=value or key=key, value=value here at the caller.

bindings/pydrake/test/memory_leak_test.py line 294 at r4 (raw file):

    diagram.SetRandomContext(simulator.get_mutable_context(), random)
    simulator.AdvanceTo(0.5)
    return _make_sentinels_from_locals("full_simulator", locals())

nit This string doesn't match the function name, which is "full_example". I imagine we want them to be 100% identical?

bindings/pydrake/test/memory_leak_test.py line 300 at r4 (raw file):

    """Calls dut() for count times in a row, performing a full garbage
    collection before and after each call. Tracks memory leaks of interest; the
    count of leaks is returned.  if `VERBOSE=True`, additional debug

typo s/if/If/

bindings/pydrake/test/memory_leak_test.py line 310 at r4 (raw file):

    Returns:
        int: the total number of leaks detected by examining returned

BTW When I read this docstring, I took it to be the number of leaked objects, not the number of dut() calls which leaked anything.

bindings/pydrake/test/memory_leak_test.py line 330 at r4 (raw file):

class TestMemoryLeaks(unittest.TestCase):
    def do_test(self, *, dut, count, leaks_allowed=0):
        """Run the requested `dut` (see _repeat() above) for `count`

nit GSG "Runs"

bindings/pydrake/test/memory_leak_test.py line 123 at r5 (raw file):

# TODO(rpoyner-tri): update or delete this when ref_cycle is adopted, changed,
# or abandoned.
def _counts_for_cycle_parts(o, name):

nit This needs a verb. Maybe "_print_counts...etc"? Currently "counts" seems like it's supposed to be the verb, which is somewhat confusing.

Actually, without any of the 22022 infra landed, this function seems somewhat premature to land now?

bindings/pydrake/test/memory_leak_test.py line 360 at r5 (raw file):

        # TODO(rpoyner-tri): investigate platform differences, if leaks persist
        # after fixing.
        leaks_by_platform = {

BTW This kind of thing is why I'm a little bit hesitant to have the "fail if the allowed number is too high". For my taste, grooming that number should be mostly manual, without CI failures nagging us about it. Since the goal number our work on this ticket is necessarily zero, I don't know that we need a lot of hand-holding to remind us to keep lowering it.

Our C++ limit_malloc has min_num_allocations but it is rarely used. If you do want to keep the lower bound as part of this test, maybe "leaks_required=1" is a better phrasing that equating leaks_allowed to leaks_required implicitly?

rpoyner-tri

+@sherm1 for next not-me platform review.

Reviewable status: LGTM missing from assignee sherm1(platform)

jwnimmer-tri

Reviewed 1 of 1 files at r6, all commit messages.
Reviewable status: LGTM missing from assignee sherm1(platform)

sherm1

Platform

Reviewed 1 of 2 files at r1, 1 of 1 files at r6, all commit messages.
Reviewable status: 3 unresolved discussions

bindings/pydrake/test/memory_leak_test.py line 99 at r6 (raw file):

        print(f"sentinel alive? {finalizer.alive}")
        if finalizer.alive:
            o = finalizer.peek()[0]

BTW a comment on this line would be helpful for me. Would a more Python-savvy person automatically understand this? Here's my (likely wrong) guess after a lot of staring at the code: peek() applied to a function returns a list of its arguments and then [0] grabs the first one, which in this case is known to be the object of interest?

bindings/pydrake/test/memory_leak_test.py line 283 at r6 (raw file):

    Returns:
        int: the total number of leaked objectss detected by examining returned

typo: objectss

bindings/pydrake/test/memory_leak_test.py line 297 at r6 (raw file):

            _report_sentinels(sentinels, "after collect")
        leaks += any(
            [sentinel.finalizer.alive for sentinel in sentinels])

BTW IIUC this adds 1 to leaks if any of this batch of sentinels is still active; it does not add 1 for each active sentinel. Is that the intention?

jwnimmer-tri

Reviewable status: 3 unresolved discussions

bindings/pydrake/test/memory_leak_test.py line 99 at r6 (raw file):

Previously, sherm1 (Michael Sherman) wrote…

BTW a comment on this line would be helpful for me. Would a more Python-savvy person automatically understand this? Here's my (likely wrong) guess after a lot of staring at the code: peek() applied to a function returns a list of its arguments and then [0] grabs the first one, which in this case is known to be the object of interest?

Does this API reference help?

https://docs.python.org/3/library/weakref.html#weakref.finalize.peek

The finalizer: weakref.finalize field is annotated as being of type weakref.finalize, and that's the API docs for its peek() method.

sherm1

Reviewable status: 2 unresolved discussions

bindings/pydrake/test/memory_leak_test.py line 99 at r6 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

Does this API reference help?

https://docs.python.org/3/library/weakref.html#weakref.finalize.peek

The finalizer: weakref.finalize field is annotated as being of type weakref.finalize, and that's the API docs for its peek() method.

Yep, thanks. I tried googling "python what does peek do" but that returned info about something else with the same name.

rpoyner-tri

Reviewable status: 2 unresolved discussions

bindings/pydrake/test/memory_leak_test.py line 297 at r6 (raw file):

Previously, sherm1 (Michael Sherman) wrote…

BTW IIUC this adds 1 to leaks if any of this batch of sentinels is still active; it does not add 1 for each active sentinel. Is that the intention?

Ah, yes, the code and documentation are at odds. Will fix.

Convert the test to unittest format, rewrite the instrumentation, and track the current numbers of leaks, with the goal of reaching 0 once fixes are landed. As leaks are fixed, the test will complain about allowing more leaks than actually occurred, prompting update or removal of the leaks allowed.

jwnimmer-tri

Reviewed 1 of 1 files at r7, all commit messages.
Reviewable status: complete! all discussions resolved, LGTM from assignees jwnimmer-tri(platform),sherm1(platform)

Convert the test to unittest format, rewrite the instrumentation, and track the current numbers of leaks, with the goal of reaching 0 once fixes are landed. As leaks are fixed, the test will complain about allowing more leaks than actually occurred, prompting update or removal of the leaks allowed.

rpoyner-tri added the release notes: none This pull request should not be mentioned in the release notes label Oct 17, 2024

rpoyner-tri assigned jwnimmer-tri Oct 17, 2024

rpoyner-tri commented Oct 17, 2024

View reviewed changes

jwnimmer-tri reviewed Oct 17, 2024

View reviewed changes

rpoyner-tri force-pushed the update-mem-leak-test branch from c93622e to 8535a92 Compare October 18, 2024 17:03

rpoyner-tri commented Oct 18, 2024

View reviewed changes

rpoyner-tri force-pushed the update-mem-leak-test branch 2 times, most recently from 6ba1b9b to 3ad502b Compare October 18, 2024 17:40

jwnimmer-tri reviewed Oct 19, 2024

View reviewed changes

rpoyner-tri force-pushed the update-mem-leak-test branch from 3ad502b to b0ce33a Compare October 21, 2024 17:30

rpoyner-tri commented Oct 21, 2024

View reviewed changes

rpoyner-tri force-pushed the update-mem-leak-test branch from b0ce33a to 6381009 Compare October 21, 2024 20:19

jwnimmer-tri reviewed Oct 21, 2024

View reviewed changes

rpoyner-tri force-pushed the update-mem-leak-test branch from 6381009 to 090ef37 Compare October 21, 2024 22:21

rpoyner-tri assigned sherm1 Oct 21, 2024

rpoyner-tri commented Oct 21, 2024

View reviewed changes

jwnimmer-tri reviewed Oct 21, 2024

View reviewed changes

jwnimmer-tri added the priority: high label Oct 21, 2024

sherm1 reviewed Oct 22, 2024

View reviewed changes

jwnimmer-tri reviewed Oct 22, 2024

View reviewed changes

sherm1 reviewed Oct 22, 2024

View reviewed changes

rpoyner-tri commented Oct 23, 2024

View reviewed changes

rpoyner-tri force-pushed the update-mem-leak-test branch from 090ef37 to f87c852 Compare October 23, 2024 00:53

jwnimmer-tri reviewed Oct 23, 2024

View reviewed changes

rpoyner-tri merged commit f4e0486 into RobotLocomotion:master Oct 23, 2024
9 checks passed

rpoyner-tri mentioned this pull request Oct 24, 2024

pydrake: Memory Leak While Generating Training Images (leaking Diagrams w/ MultibodyPlant, SceneGraph, etc) #14387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bindings] Update memory leak test #22059

[bindings] Update memory leak test #22059

rpoyner-tri commented Oct 17, 2024 •

edited by sherm1

Loading

rpoyner-tri left a comment

jwnimmer-tri left a comment

rpoyner-tri left a comment

rpoyner-tri left a comment

jwnimmer-tri left a comment

rpoyner-tri left a comment

jwnimmer-tri left a comment

rpoyner-tri left a comment

jwnimmer-tri left a comment

sherm1 left a comment

jwnimmer-tri left a comment

sherm1 left a comment

rpoyner-tri left a comment

jwnimmer-tri left a comment

[bindings] Update memory leak test #22059

[bindings] Update memory leak test #22059

Conversation

rpoyner-tri commented Oct 17, 2024 • edited by sherm1 Loading

rpoyner-tri left a comment

Choose a reason for hiding this comment

jwnimmer-tri left a comment

Choose a reason for hiding this comment

rpoyner-tri left a comment

Choose a reason for hiding this comment

rpoyner-tri left a comment

Choose a reason for hiding this comment

jwnimmer-tri left a comment

Choose a reason for hiding this comment

rpoyner-tri left a comment

Choose a reason for hiding this comment

jwnimmer-tri left a comment

Choose a reason for hiding this comment

rpoyner-tri left a comment

Choose a reason for hiding this comment

jwnimmer-tri left a comment

Choose a reason for hiding this comment

sherm1 left a comment

Choose a reason for hiding this comment

jwnimmer-tri left a comment

Choose a reason for hiding this comment

sherm1 left a comment

Choose a reason for hiding this comment

rpoyner-tri left a comment

Choose a reason for hiding this comment

jwnimmer-tri left a comment

Choose a reason for hiding this comment

rpoyner-tri commented Oct 17, 2024 •

edited by sherm1

Loading