Watch instances not getting garbage collected #985

efroemling · 2024-11-21T05:53:39Z

Environment details

OS type and version: Ubuntu 24.04.1 LTS
Python version: Python 3.12.3 (main, Nov 6 2024, 18:32:19) [GCC 13.2.0]
pip version: pip 24.3.1
google-cloud-firestore version: 2.19.0

I'm working on a Python server app which creates/destroys a decent number of document listeners as part of its operation, and have noticed some of my objects being kept alive longer than I'd expect. I have traced this down to Firestore Watch objects hanging around holding onto callbacks for a long time after I've unsubscribed and released all references to them.

I've put together a minimal repro case which demonstrates a reluctant-to-die Watch object, and also a workaround example showing how clearing a few internal fields after an unsubscribe call seems to break cycles or whatnot and allow it to go down immediately.

Steps to reproduce

Run the test_watch_cleanup() call from the code below. It takes a Client instance, creates a Watch, unsubscribes the Watch a few seconds later, and finally spins off a thread holding only a weak-ref and waits for the Watch to be garbage collected. In my case this results in the Watch living on indefinitely or at least for a long while.
In the same example, flip WORKAROUND to True and run it again. In my case this results in the Watch object going down immediately once no longer in use.

Code example

import time
import functools
import weakref
import threading
from typing import Any

import google.cloud.firestore_v1

WORKAROUND = False

def test_watch_cleanup(client: google.cloud.firestore_v1.Client) -> None:
    """Test Watch object garbage collection"""

    # Create a ref to any document (doesn't need to exist).
    doc_ref = client.collection('some_collection').document('some_doc_id')

    # Start listening for changes.
    doc_watch = doc_ref.on_snapshot(on_snapshot)

    # Give it a moment.
    time.sleep(3.0)

    if WORKAROUND:
        # Need to grab these here before unsubscribe clears them.
        rpc = doc_watch._rpc
        consumer = doc_watch._consumer

    # Stop listening for changes.
    doc_watch.unsubscribe()

    if WORKAROUND:
        # From looking at which references were keeping doc_watch alive,
        # I found that clearing these fields after the unsubscribe allows
        # it to go down cleanly.
        rpc._initial_request = None
        rpc._callbacks = []
        consumer._on_response = None
        doc_watch._snapshot_callback = None

    # Lastly, kick off a thread to keep an eye on doc_watch until it dies.
    threading.Thread(
        target=functools.partial(watch_the_watcher, weakref.ref(doc_watch))
    ).start()

def on_snapshot(*args: Any) -> None:
    """Listener callback"""
    print('got snapshot')

def watch_the_watcher(doc_watch_weak_ref: weakref.ReferenceType) -> None:
    """Spin and wait for doc_watch to die."""

    starttime = time.time()
    while doc_watch_weak_ref() is not None:
        print(f'doc_watch is still alive ({time.time()-starttime:.0f}s)')
        time.sleep(5.0)
    print('doc_watch IS DEAD!!!')

Results

With WORKAROUND=False I get:

got snapshot
WARNING:google.api_core.bidi:Background thread did not exit.
doc_watch is still alive (0s)
INFO:google.api_core.bidi:Thread-ConsumeBidirectionalStream exiting
doc_watch is still alive (5s)
doc_watch is still alive (10s)
doc_watch is still alive (15s)
doc_watch is still alive (20s)
doc_watch is still alive (25s)
<etc etc>

With WORKAROUND=True I get:

got snapshot
WARNING:google.api_core.bidi:Background thread did not exit.
doc_watch is still alive (0s)
INFO:google.api_core.bidi:Thread-ConsumeBidirectionalStream exiting
doc_watch IS DEAD!!!

Curious if others get the same results.
If so, would it make sense to add something similar to my workaround code as part of unsubscribe() or whatnot to allow these things to go down more smoothly?

The text was updated successfully, but these errors were encountered:

daniel-sanche · 2024-12-16T22:27:38Z

Thanks for the very detailed reproduction, I can confirm I see the same issue.

It looks like the leak is happening in the shared api_core code. I opened an issue, and a potential fix

After those are through, we may want to clean up doc_watch._snapshot_callback = None in this library as well to be safe, but it looks like your test is passing for me with just the api_core fix

efroemling · 2024-12-17T14:28:49Z

Thanks for the very detailed reproduction, I can confirm I see the same issue.

Great to hear! Thank you for looking into that.

If there's no downsides to it, I'd throw my vote in for clearing doc_watch._snapshot_callback even though that's not the direct cause of this particular issue. In my code I'm pointing the Watch callback at a bound method of an object which itself holds a ref to the Watch, so I get a reference cycle. I'm breaking this cycle by clearing my Watch ref after unsubscribing, but having the Watch clear its own callback ref might help keep cleanup deterministic by default in more cases instead of having to wait for the cyclic garbage collector.

daniel-sanche · 2025-01-06T23:52:44Z

I opened a PR to address this while we wait on the upstream fix. Thanks for your patience!

efroemling · 2025-01-29T20:48:29Z

Hi; just a quick followup on this:

I'm now google-cloud-firestore 2.20.0 with the workaround (though it seems the underlying bidi fix in google-api-core has not come through yet). I'm now running tons of listeners and everything seems to be behaving well. However I have seen a small number of these errors in my logs:

Thread-ConsumeBidirectionalStream caught unexpected exception 'NoneType' object is not callable and will exit.
Traceback (most recent call last):
  File "/home/ubuntu/basn_lib/venv/3.12/lib/python3.12/site-packages/google/api_core/bidi.py", line 667, in _thread_main
    self._on_response(response)
TypeError: 'NoneType' object is not callable

It looks like there's some rare condition where data is coming through after the close has cleared the _on_response callback. Should this be another minor fix on the bidi end to watch out for the None case or whatnot?

daniel-sanche · 2025-01-30T21:29:42Z

Hmm ok, thanks for letting me know

Calling on_response is gated under while self._bidi_rpc.is_active:, which means that block should only be called if self._thread is not None or not self._thread.is_alive()). It looks like it's possible for the thread to fail to exit. Maybe attach to the logger for the module and see if that log is getting sent? Does this happen often enough to be reproducible?

It seems like maybe we should add a null check before calling the callback. But I'd like to find the root cause if we can. I wonder why the thread would not be exiting... (maybe it's just slow to exit?)

We did discuss the possibility or streams being re-opened, but I don't think that should be relevant here

efroemling · 2025-01-30T21:41:14Z

In my case I actually pretty much always get the 'WARNING:google.api_core.bidi:Background thread did not exit.' log. You can see in my original post above that it appears in my repro case output. So I guess that would explain how that error case is possible.

Should I poke around and try to figure out what is causing that thread to fail to exit on my end?

product-auto-label bot added the api: firestore Issues related to the googleapis/python-firestore API. label Nov 21, 2024

blunderbuss-gcf bot assigned MarkDuckworth Nov 21, 2024

MarkDuckworth assigned daniel-sanche and unassigned MarkDuckworth Nov 21, 2024

daniel-sanche mentioned this issue Dec 16, 2024

Memory leak in bidi resources googleapis/python-api-core#769

Closed

daniel-sanche added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Dec 16, 2024

daniel-sanche mentioned this issue Jan 6, 2025

fix: clean up resources on Watch close #1004

Merged

daniel-sanche closed this as completed Jan 8, 2025

daniel-sanche reopened this Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watch instances not getting garbage collected #985

Watch instances not getting garbage collected #985

efroemling commented Nov 21, 2024 •

edited

Loading

daniel-sanche commented Dec 16, 2024

efroemling commented Dec 17, 2024 •

edited

Loading

daniel-sanche commented Jan 6, 2025

efroemling commented Jan 29, 2025

daniel-sanche commented Jan 30, 2025 •

edited

Loading

efroemling commented Jan 30, 2025

Watch instances not getting garbage collected #985

Watch instances not getting garbage collected #985

Comments

efroemling commented Nov 21, 2024 • edited Loading

Environment details

Steps to reproduce

Code example

Results

daniel-sanche commented Dec 16, 2024

efroemling commented Dec 17, 2024 • edited Loading

daniel-sanche commented Jan 6, 2025

efroemling commented Jan 29, 2025

daniel-sanche commented Jan 30, 2025 • edited Loading

efroemling commented Jan 30, 2025

efroemling commented Nov 21, 2024 •

edited

Loading

efroemling commented Dec 17, 2024 •

edited

Loading

daniel-sanche commented Jan 30, 2025 •

edited

Loading