-
Notifications
You must be signed in to change notification settings - Fork 2
Debugging memory issues #10
Comments
With debug logs:
So AFAICT, the lock |
I'd love to see what's going on on that worker... I think kubernetes lets you ssh into pods. |
@TomAugspurger As a note all of the tests in pangeo-forge/pangeo-forge-recipes#151 were run using
We did not experience any hanging worker issues beyond those outlined in pangeo-forge/pangeo-forge-recipes#144 so I'm unsure if this is azure blob storage related or perhaps due to a change in the @CiaranEvans Can we investigate how to expose the external network interface for the worker pods and they key we'll need to set so Tom can ssh into a hung worker pod? |
I maybe got it with I did another run and this time the |
@TomAugspurger Had that pod died by the time you tried to connect? It's not the case of Prefect/K8s killing it as soon as it errors? |
Nope, it was still alive. The only pods I've seen killed are (I think) due to #11. |
Hmm okay, I'm unsure how kubectl and getting on a running container works tbh. If it can't get a connection, does it hang or will it error |
I believe I was mistaken when I initially said that the lock was actually able to be acquired in #10 (comment). Pangeo-forge prepends the key My guess right now is that some other worker failed to release that lock. I'll try to confirm that. |
I just did a run with no locks (by commenting out the locking in TomAugspurger/pangeo-forge@f22e73c) and it completed successfully. I think this recipe doesn't have any conflicting locks, but I need to confirm that . So a few things:
|
Locking should only be happening if it is actually needed for concurrent writes to the same chunk. AFAIK, none of our usual test cases need any locking. |
@TomAugspurger If you would like a simpler example to debug worker memory leaks you can use https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/dask_memory_flow/flow_test/dask_memory_flow.py so that we can isolate this issue from any interactions with other dependencies. |
FYI @sharkinsspatial I tried that dask_memory_flow.py flow and did not see any issues. Notes below The scheduler pod is currently being killed for OOM The Memory use climbed to ~380 - 385MiB by the end. So the summary is that prefect seems to be doing OK with that number of tasks. I followed that up with a test that subclassed |
@TomAugspurger 👍 Thanks for the investigation. Can you report the Based on your |
I'll verify, but this was using commit b20c361, so dask versions are at pangeo-forge-azure-bakery/images/requirements.txt Lines 46 to 48 in b20c361
I think the two next options are one or both of
|
I started work on option the two next steps mentioned in #10 (comment) at this branch: https://github.com/TomAugspurger/pangeo-forge/tree/refactor. That builds on pangeo-forge/pangeo-forge-recipes#153 adding a I'm doing a run right now (using the XarrayToZarr that just sleeps instead of writing). It's halfway through the |
@sharkinsspatial I'm picking up this debugging a bit, to verify what fixes the memory issue. My plan is to run a recipe that has the fixed class SleepingInputCache(CacheFSSpecTarget):
def cache_file(self, fname, **fsspec_open_kwargs):
time.sleep(1)
return Then I'll run that with three versions of pangeo-forge-recipes:
I'll post updates here. |
OK, here are the results. For each of these I built a docker image and submitted & ran the flow.
So tl/dr, pangeo-forge/pangeo-forge-recipes#160 fixes the memory issues on the scheduler, and seems to be necessary. A small note, workers are building up a bunch of unmanaged memory. That surprises me, since we're just sleeping. This might need more investigation down the road. |
Thanks so much for doing this forensic work Tom! We will go with pangeo-forge/pangeo-forge-recipes#160. |
Thanks @TomAugspurger. It appears that your PR will solve the scheduler memory growth issues associated with serialization 🎊 . As you noted above we are still seeing incremental memory growth on workers (even without actual activity) as originally noted here. This is problematic with several of our recipes as the worker memory growth over a large number of task executions will result in eventual worker OOM failures (which we were seeing in out initial OISST testing). I'll continue tracking this here and touch base with the Prefect team again to see if they have made any progress on their investigations. |
Hi All, Hopefully the above should be addressed in #21 when it is reviewed and merged. |
@sharkinsspatial dumping some notes below. Let me know if you want to jump on a call to discuss. I'm current seeing workers hanging as they try to acquire a lock in pangeo-forge. I'll see what I can do to debug.
Summarizing the issues we're seeing with pangeo-forge.
** Changes to environment
Updated to latest released Dask, distributed, fsspec, adlfs. Installed pangeo-forge-recipes from GitHub.
** Cloud build
If you want to build the images on Azure, avoid upload. Not a huge benefit, if we have to download them to submit.
** Access Dask Dashboard
One of the
dask-root
pods prefect starts is the scheduler pod.** Add logs to worker handler
This makes the logs accessible from the Dask UI. I'm sure there's a better way to do this.
** Hanging Workers
I'm seeing some hanging, but perhaps different from what others saw. The worker logs say
Looking at the call stack
So seems like the issue is in
lock_for_conflicts
. Either a real deadlock, or an event loop issue.The text was updated successfully, but these errors were encountered: