Debugging memory issues #10

TomAugspurger · 2021-06-09T21:08:24Z

@sharkinsspatial dumping some notes below. Let me know if you want to jump on a call to discuss. I'm current seeing workers hanging as they try to acquire a lock in pangeo-forge. I'll see what I can do to debug.

Summarizing the issues we're seeing with pangeo-forge.

** Changes to environment

Updated to latest released Dask, distributed, fsspec, adlfs. Installed pangeo-forge-recipes from GitHub.

** Cloud build

If you want to build the images on Azure, avoid upload. Not a huge benefit, if we have to download them to submit.

diff --git a/scripts/build_and_push_image.sh b/scripts/build_and_push_image.sh
index 36a8226..d6b7809 100755
--- a/scripts/build_and_push_image.sh
+++ b/scripts/build_and_push_image.sh
@@ -1,15 +1,15 @@
 #!/bin/bash
+set -eu
 
 REGISTRY_NAME=$(terraform -chdir="terraform" output -raw bakery_image_registry_name)
 LOCAL_TAG=pangeo-forge-azure-bakery-image
 REMOTE_TAG=$REGISTRY_NAME.azurecr.io/$LOCAL_TAG:latest
 
 az acr login --name $REGISTRY_NAME
+az acr build --registry $REGISTRY_NAME --image $LOCAL_TAG:latest ./images/
 
-docker build -t $LOCAL_TAG -f ./images/Dockerfile ./images/
-
-docker tag $LOCAL_TAG $REMOTE_TAG
-
-docker push $REMOTE_TAG
+# docker build -t $LOCAL_TAG -f ./images/Dockerfile ./images/
+# docker tag $LOCAL_TAG $REMOTE_TAG
+# docker push $REMOTE_TAG
 
 python3 ./scripts/replace_or_insert_value.py $REMOTE_TAG AZURE_BAKERY_IMAGE

** Access Dask Dashboard

One of the dask-root pods prefect starts is the scheduler pod.

kubectl -n pangeo-forge port-forward `kubectl -n pangeo-forge get pod -l dask.org/component=scheduler -o name` 8786 8787

** Add logs to worker handler

This makes the logs accessible from the Dask UI. I'm sure there's a better way to do this.

def set_log_level(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        level = logging.INFO

        logger = logging.getLogger("pangeo_forge_recipes")
        logger.setLevel(level)
        handler = logging.StreamHandler()
        handler.setLevel(level)
        formatter = logging.Formatter("[%(asctime)s - %(levelname)s - %(filename)s:%(lineno)s - %(funcName)10s() ] %(message)s")
        handler.setFormatter(formatter)
        logger.addHandler(handler)

        try:
            worker = get_worker()
            worker._setup_logging(logger)
        except ValueError:
            pass

        result = func(*args, **kwargs)
        return result

    return wrapper

** Hanging Workers

I'm seeing some hanging, but perhaps different from what others saw. The worker logs say

pangeo_forge_recipes.storage - INFO - Opening 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/201909/oisst-avhrr-v02r01.20190920.nc' from cache
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Combining inputs for chunk '(0,)'
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable time chunk (0,) to Zarr region (slice(0, 20, None),)
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable anom chunk (0,) to Zarr region (slice(0, 20, None), slice(None, None, None), slice(None, None, None), slice(None, None, None))
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable err chunk (0,) to Zarr region (slice(0, 20, None), slice(None, None, None), slice(None, None, None), slice(None, None, None))
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable ice chunk (0,) to Zarr region (slice(0, 20, None), slice(None, None, None), slice(None, None, None), slice(None, None, None))

Looking at the call stack

Call Stacks: tcp://10.244.0.164:44331
Worker: tcp://10.244.0.164:44331
Key: MappedTaskWrapper-0-0e2e432e5be94e1797ddf1cdda29b9fd
File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner()

File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run()

File "/usr/local/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs)

File "/usr/local/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 55, in _worker task.run()

File "/usr/local/lib/python3.8/site-packages/distributed/_concurrent_futures_thread.py", line 66, in run result = self.fn(*self.args, **self.kwargs)

File "/usr/local/lib/python3.8/site-packages/distributed/worker.py", line 3722, in apply_function result = function(*args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/flow_runner.py", line 775, in run_task return task_runner.run(

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 425, in wrapper return func(*args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 354, in run end_state = super().run(

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 425, in wrapper return func(*args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 298, in run state = self.get_task_run_state(state, inputs=task_inputs)

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 71, in inner return runner_method(self, *args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner new_state = method(self, state, *args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 865, in get_task_run_state value = prefect.utilities.executors.run_task_with_timeout(

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore

File "/oisst_recipe.py", line 39, in wrapper

File "/usr/local/lib/python3.8/site-packages/rechunker/executors/prefect.py", line 30, in run return self.stage.func(key)

File "/usr/local/lib/python3.8/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 325, in store_chunk with lock_for_conflicts(lock_keys):

File "/usr/local/lib/python3.8/contextlib.py", line 113, in __enter__ return next(self.gen)

File "/usr/local/lib/python3.8/site-packages/pangeo_forge_recipes/utils.py", line 101, in lock_for_conflicts lock.acquire()

File "/usr/local/lib/python3.8/site-packages/distributed/lock.py", line 135, in acquire result = self.client.sync(

File "/usr/local/lib/python3.8/site-packages/distributed/client.py", line 853, in sync return sync(

File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 351, in sync e.wait(10)

File "/usr/local/lib/python3.8/threading.py", line 558, in wait signaled = self._cond.wait(timeout)

File "/usr/local/lib/python3.8/threading.py", line 306, in wait gotit = waiter.acquire(True, timeout)

So seems like the issue is in lock_for_conflicts. Either a real deadlock, or an event loop issue.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2021-06-09T21:29:52Z

With debug logs:

pangeo_forge_recipes.recipes.xarray_zarr - DEBUG - Acquiring locks ['time-23']

pangeo_forge_recipes.utils - DEBUG - Acquiring lock pangeo-forge-time-23...

pangeo_forge_recipes.utils - DEBUG - Acquired lock pangeo-forge-time-23

pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable time chunk (23,) to Zarr region (slice(460, 480, None),)

pangeo_forge_recipes.utils - DEBUG - Released lock pangeo-forge-time-23

pangeo_forge_recipes.recipes.xarray_zarr - DEBUG - Acquiring locks ['sst-23']

pangeo_forge_recipes.utils - DEBUG - Acquiring lock pangeo-forge-sst-23...

So AFAICT, the lock sst-23 was released prior that worker trying to acquire it.

TomAugspurger · 2021-06-09T21:37:23Z

~~Debugging, the lock is in fact available:~~ (this is wrong: see #10 (comment))

# shell 1
$ kubectl port-forward -n pangeo-forge dask-root-b4bde08a-3rdfkz 8786

# shell 2
$ python
Python 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import distributed
>>> client = distributed.Client("tcp://127.0.0.1:8786")
>>> lock = distributed.Lock("sst-23")
>>> lock.acquire(timeout=5)
True
>>> lock.release()
>>>

I'd love to see what's going on on that worker... I think kubernetes lets you ssh into pods.

sharkinsspatial · 2021-06-10T01:51:49Z

@TomAugspurger As a note all of the tests in pangeo-forge/pangeo-forge-recipes#151 were run using pangeo/pangeo-notebook:2021.05.04 as a base image with the following modifications.

xarray==0.18.0
pangeo-forge-recipes==0.3.4
prefect[aws, azure, google, kubernetes]==0.14.7

We did not experience any hanging worker issues beyond those outlined in pangeo-forge/pangeo-forge-recipes#144 so I'm unsure if this is azure blob storage related or perhaps due to a change in the pangeo-forge-recipes main branch. Do you have a branch which shows the mapper implementation you were using?

@CiaranEvans Can we investigate how to expose the external network interface for the worker pods and they key we'll need to set so Tom can ssh into a hung worker pod?

TomAugspurger · 2021-06-10T01:56:46Z

I maybe got it with kubectl exec -it -n pangeo-forge dask-root-b4bde08a-32rpqk -- /bin/bash The terminal was hanging though.

I did another run and this time the lock actually seemed to be claimed. I'm seeing if we can get a bit better information into distributed's locks, to figure out who has what.

ciaransweet · 2021-06-10T09:57:50Z

@TomAugspurger Had that pod died by the time you tried to connect?

It's not the case of Prefect/K8s killing it as soon as it errors?

TomAugspurger · 2021-06-10T11:11:53Z

Nope, it was still alive. The only pods I've seen killed are (I think) due to #11.

ciaransweet · 2021-06-10T11:24:55Z

Hmm okay, I'm unsure how kubectl and getting on a running container works tbh. If it can't get a connection, does it hang or will it error

TomAugspurger · 2021-06-10T13:49:42Z

I did another run and this time the lock actually seemed to be claimed. I'm seeing if we can get a bit better information into distributed's locks, to figure out who has what.

I believe I was mistaken when I initially said that the lock was actually able to be acquired in #10 (comment). Pangeo-forge prepends the key pangeo-forge to the locks it acquires, so I should have acquired pangeo-forge-sst-23 rather than just sst-23.

My guess right now is that some other worker failed to release that lock. I'll try to confirm that.

TomAugspurger · 2021-06-10T14:33:11Z

I just did a run with no locks (by commenting out the locking in TomAugspurger/pangeo-forge@f22e73c) and it completed successfully. I think this recipe doesn't have any conflicting locks, but I need to confirm that . So a few things:

Locks not being released might be a real problem. We should work with distributed to provide better diagnostics. In particular, I wonder if the LockExtension object could be updated to use the client.id of the client that acquires the lock, and lock acquisitions / releases should be logged to the /events dashboard.
2. I didn't observe any memory issues. This might be because we're using the latest versions of adlfs, dask, distributed, fsspec, and pangeo-forge-recipes. Or it might be because we're using adlfs rather than s3fs. @sharkinsspatial can you confirm: did you ever see memory issues with the Azure bakery (I think not, since you were hitting the adlfs / fsspec caching issue)? Sorry, I forgot I made the date_range smaller for debugging. I'll try with the full range now.
pangeo-forge-recipes might be able to disable locking, if it knows that there aren't any conflicts across the entire recipe. I'll see if this is doable.

rabernat · 2021-06-10T14:48:05Z

Locking should only be happening if it is actually needed for concurrent writes to the same chunk.

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/master/pangeo_forge_recipes/recipes/xarray_zarr.py#L305

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/d96c57ca8a286fa3504b385f344cf4a63a350eca/pangeo_forge_recipes/recipes/xarray_zarr.py#L325-L327

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/d96c57ca8a286fa3504b385f344cf4a63a350eca/pangeo_forge_recipes/utils.py#L103-L105

AFAIK, none of our usual test cases need any locking.

sharkinsspatial · 2021-06-10T16:58:30Z

@TomAugspurger If you would like a simpler example to debug worker memory leaks you can use https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/dask_memory_flow/flow_test/dask_memory_flow.py so that we can isolate this issue from any interactions with other dependencies.

TomAugspurger · 2021-06-15T15:09:48Z

FYI @sharkinsspatial I tried that dask_memory_flow.py flow and did not see any issues. Notes below

The scheduler pod is currently being killed for OOM
(0d751e5, https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/50d52570-449c-4083-9fd3-cbc7efa0ea91)

The dask_memory_flow, which just maps a bunch of sleeps, looks pretty fine.
https://cloud.prefect.io/flow-run/aab25c7f-093d-498a-9266-1d5f5560a0a4 /
b20c361
Memory use on the scheduler is essentially flat. Small jumps (2-3 MB) when a
group of workers is added. Small increase per "tick" of 0.08MB. I wonder if
that stabilizes.

Memory use climbed to ~380 - 385MiB by the end.

So the summary is that prefect seems to be doing OK with that number of tasks. I followed that up with a test that subclassed XarrayToZarr and override prepare_target and store_chunk to just time.sleep. The goal was to verify that it's the pangeo-forge recipe object causing issues with scheduler memory, rather than actual execution itself. With that setup, the scheduler memory jumped to 4.5 GiB within 25s, before being OOM-killed by kubernetes.

sharkinsspatial · 2021-06-15T15:32:15Z

@TomAugspurger 👍 Thanks for the investigation. Can you report the distributed, dask and prefect versions on the worker image used for your dask_memory_flow test? I'll run another AWS test using this version combination so that we can verify if this worker memory issue is specific to the dask-cloudprovider ECSCluster.

Based on your subclass experiment it appears that recipe serialization may be the culprit in our scheduler memory woes. What are your thoughts on next steps? If you want to continue to coordinate with @rabernat on testing and potential solutions for this, I'll work with the Prefect team to try and diagnose our AWS worker memory issues this week.

TomAugspurger · 2021-06-15T15:36:31Z

I'll verify, but this was using commit b20c361, so dask versions are at

pangeo-forge-azure-bakery/images/requirements.txt

Lines 46 to 48 in b20c361

    
           dask==2021.6.0 
        
           dask-kubernetes==2021.3.1 
        
           distributed==2021.6.0

(2021.6.0) and prefect is at

pangeo-forge-azure-bakery/images/requirements.txt

Line 93 in b20c361

prefect==0.14.19

(0.14.19)

What are your thoughts on next steps?

I think the two next options are one or both of

Try out New implementation of Dask & Prefect executors pangeo-forge-recipes#153 on the sleep flow. Should be pretty easy. I can do that next
Restructure the Recipe class to be a bit simpler in what it passes around. Basically the tasks would be top-level functions that just take some arguments, rather than passing the Recipe object around. This is more work.

TomAugspurger · 2021-06-15T20:26:08Z

I started work on option the two next steps mentioned in #10 (comment) at this branch: https://github.com/TomAugspurger/pangeo-forge/tree/refactor.

That builds on pangeo-forge/pangeo-forge-recipes#153 adding a to_dask() and to_prefect(). It simplifies the objects that end up in a prefect Task / task graph by moving essentially removing self from all the functions and instead explicitly passes the arguments around.

I'm doing a run right now (using the XarrayToZarr that just sleeps instead of writing). It's halfway through the store_chunk stage, and memory use on the scheduler is steady at 1.9 GB. Since it's running well, I think we're able to confirm that the XarrayToZarr objects were the ones causing the serialization issues.

TomAugspurger · 2021-06-23T19:25:03Z

@sharkinsspatial I'm picking up this debugging a bit, to verify what fixes the memory issue. My plan is to run a recipe that has the fixed FilePattern avoiding the filepattern_from_sequenc helper, with an input_cache that just sleeps (doesn't actually cache):

class SleepingInputCache(CacheFSSpecTarget):
    def cache_file(self, fname, **fsspec_open_kwargs):
        time.sleep(1)
        return

Then I'll run that with three versions of pangeo-forge-recipes:

pangeo-forge-recipes master
New implementation of Dask & Prefect executors pangeo-forge-recipes#153
Refactor XarrayZarrRecipe to be serialization-friendly pangeo-forge-recipes#160

I'll post updates here.

TomAugspurger · 2021-06-23T21:17:52Z

OK, here are the results. For each of these I built a docker image and submitted & ran the flow.

Test	Outcome	Prefect Link (probably not public)	Git Commit
pangeo-forge-recipes master	OOM-Killed scheduler at ~3GB after 60s	https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/36ab5e8a-7605-49a3-bf85-45c58b7e7374	`11d605e`
Ryan's PR	OOM-Killed scheduler at ~3GB after 60s	https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/242b7ad7-8932-4172-ac53-a567b605935d	`11d605e`
Tom's PR	Stable after 15 minutes / ~20k tasks	https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/cff125b1-2318-4fc6-b797-81129b0eb441	`a0576f4`

So tl/dr, pangeo-forge/pangeo-forge-recipes#160 fixes the memory issues on the scheduler, and seems to be necessary.

A small note, workers are building up a bunch of unmanaged memory. That surprises me, since we're just sleeping. This might need more investigation down the road.

rabernat · 2021-06-24T08:11:51Z

Thanks so much for doing this forensic work Tom! We will go with pangeo-forge/pangeo-forge-recipes#160.

sharkinsspatial · 2021-06-24T14:22:36Z

Thanks @TomAugspurger. It appears that your PR will solve the scheduler memory growth issues associated with serialization 🎊 . As you noted above we are still seeing incremental memory growth on workers (even without actual activity) as originally noted here. This is problematic with several of our recipes as the worker memory growth over a large number of task executions will result in eventual worker OOM failures (which we were seeing in out initial OISST testing). I'll continue tracking this here and touch base with the Prefect team again to see if they have made any progress on their investigations.

tracetechnical · 2021-08-25T16:50:56Z

Hi All,

Hopefully the above should be addressed in #21 when it is reviewed and merged.

TomAugspurger mentioned this issue Jun 10, 2021

Configurable lock timeouts pangeo-forge/pangeo-forge-recipes#158

Merged

rabernat closed this as completed Jun 10, 2021

TomAugspurger reopened this Jun 10, 2021

TomAugspurger mentioned this issue Jun 16, 2021

Refactor XarrayZarrRecipe to be serialization-friendly pangeo-forge/pangeo-forge-recipes#160

Merged

6 tasks

tracetechnical mentioned this issue Aug 24, 2021

feat: add loki and dask debugger #21

Merged

sharkinsspatial mentioned this issue Sep 21, 2021

Prefect / Dask interaction tracking issue pangeo-forge/pangeo-forge-recipes#208

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging memory issues #10

Debugging memory issues #10

TomAugspurger commented Jun 9, 2021 •

edited

Loading

TomAugspurger commented Jun 9, 2021

TomAugspurger commented Jun 9, 2021 •

edited

Loading

sharkinsspatial commented Jun 10, 2021

TomAugspurger commented Jun 10, 2021

ciaransweet commented Jun 10, 2021

TomAugspurger commented Jun 10, 2021

ciaransweet commented Jun 10, 2021

TomAugspurger commented Jun 10, 2021

TomAugspurger commented Jun 10, 2021 •

edited

Loading

rabernat commented Jun 10, 2021

sharkinsspatial commented Jun 10, 2021

TomAugspurger commented Jun 15, 2021

sharkinsspatial commented Jun 15, 2021

TomAugspurger commented Jun 15, 2021 •

edited

Loading

TomAugspurger commented Jun 15, 2021 •

edited

Loading

TomAugspurger commented Jun 23, 2021

TomAugspurger commented Jun 23, 2021

rabernat commented Jun 24, 2021

sharkinsspatial commented Jun 24, 2021

tracetechnical commented Aug 25, 2021

Debugging memory issues #10

Debugging memory issues #10

Comments

TomAugspurger commented Jun 9, 2021 • edited Loading

TomAugspurger commented Jun 9, 2021

TomAugspurger commented Jun 9, 2021 • edited Loading

sharkinsspatial commented Jun 10, 2021

TomAugspurger commented Jun 10, 2021

ciaransweet commented Jun 10, 2021

TomAugspurger commented Jun 10, 2021

ciaransweet commented Jun 10, 2021

TomAugspurger commented Jun 10, 2021

TomAugspurger commented Jun 10, 2021 • edited Loading

rabernat commented Jun 10, 2021

sharkinsspatial commented Jun 10, 2021

TomAugspurger commented Jun 15, 2021

sharkinsspatial commented Jun 15, 2021

TomAugspurger commented Jun 15, 2021 • edited Loading

TomAugspurger commented Jun 15, 2021 • edited Loading

TomAugspurger commented Jun 23, 2021

TomAugspurger commented Jun 23, 2021

rabernat commented Jun 24, 2021

sharkinsspatial commented Jun 24, 2021

tracetechnical commented Aug 25, 2021

TomAugspurger commented Jun 9, 2021 •

edited

Loading

TomAugspurger commented Jun 9, 2021 •

edited

Loading

TomAugspurger commented Jun 10, 2021 •

edited

Loading

TomAugspurger commented Jun 15, 2021 •

edited

Loading

TomAugspurger commented Jun 15, 2021 •

edited

Loading