Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Debugging memory issues #10

Open
TomAugspurger opened this issue Jun 9, 2021 · 20 comments
Open

Debugging memory issues #10

TomAugspurger opened this issue Jun 9, 2021 · 20 comments

Comments

@TomAugspurger
Copy link

TomAugspurger commented Jun 9, 2021

@sharkinsspatial dumping some notes below. Let me know if you want to jump on a call to discuss. I'm current seeing workers hanging as they try to acquire a lock in pangeo-forge. I'll see what I can do to debug.

Summarizing the issues we're seeing with pangeo-forge.

** Changes to environment

Updated to latest released Dask, distributed, fsspec, adlfs. Installed pangeo-forge-recipes from GitHub.

** Cloud build

If you want to build the images on Azure, avoid upload. Not a huge benefit, if we have to download them to submit.

diff --git a/scripts/build_and_push_image.sh b/scripts/build_and_push_image.sh
index 36a8226..d6b7809 100755
--- a/scripts/build_and_push_image.sh
+++ b/scripts/build_and_push_image.sh
@@ -1,15 +1,15 @@
 #!/bin/bash
+set -eu
 
 REGISTRY_NAME=$(terraform -chdir="terraform" output -raw bakery_image_registry_name)
 LOCAL_TAG=pangeo-forge-azure-bakery-image
 REMOTE_TAG=$REGISTRY_NAME.azurecr.io/$LOCAL_TAG:latest
 
 az acr login --name $REGISTRY_NAME
+az acr build --registry $REGISTRY_NAME --image $LOCAL_TAG:latest ./images/
 
-docker build -t $LOCAL_TAG -f ./images/Dockerfile ./images/
-
-docker tag $LOCAL_TAG $REMOTE_TAG
-
-docker push $REMOTE_TAG
+# docker build -t $LOCAL_TAG -f ./images/Dockerfile ./images/
+# docker tag $LOCAL_TAG $REMOTE_TAG
+# docker push $REMOTE_TAG
 
 python3 ./scripts/replace_or_insert_value.py $REMOTE_TAG AZURE_BAKERY_IMAGE

** Access Dask Dashboard

One of the dask-root pods prefect starts is the scheduler pod.

kubectl -n pangeo-forge port-forward `kubectl -n pangeo-forge get pod -l dask.org/component=scheduler -o name` 8786 8787

** Add logs to worker handler

This makes the logs accessible from the Dask UI. I'm sure there's a better way to do this.

def set_log_level(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        level = logging.INFO

        logger = logging.getLogger("pangeo_forge_recipes")
        logger.setLevel(level)
        handler = logging.StreamHandler()
        handler.setLevel(level)
        formatter = logging.Formatter("[%(asctime)s - %(levelname)s - %(filename)s:%(lineno)s - %(funcName)10s() ] %(message)s")
        handler.setFormatter(formatter)
        logger.addHandler(handler)

        try:
            worker = get_worker()
            worker._setup_logging(logger)
        except ValueError:
            pass

        result = func(*args, **kwargs)
        return result

    return wrapper

** Hanging Workers

I'm seeing some hanging, but perhaps different from what others saw. The worker logs say

pangeo_forge_recipes.storage - INFO - Opening 'https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr/201909/oisst-avhrr-v02r01.20190920.nc' from cache
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Combining inputs for chunk '(0,)'
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable time chunk (0,) to Zarr region (slice(0, 20, None),)
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable anom chunk (0,) to Zarr region (slice(0, 20, None), slice(None, None, None), slice(None, None, None), slice(None, None, None))
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable err chunk (0,) to Zarr region (slice(0, 20, None), slice(None, None, None), slice(None, None, None), slice(None, None, None))
pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable ice chunk (0,) to Zarr region (slice(0, 20, None), slice(None, None, None), slice(None, None, None), slice(None, None, None))

Looking at the call stack

Call Stacks: tcp://10.244.0.164:44331
Worker: tcp://10.244.0.164:44331
Key: MappedTaskWrapper-0-0e2e432e5be94e1797ddf1cdda29b9fd
File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap self._bootstrap_inner()

File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run()

File "/usr/local/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs)

File "/usr/local/lib/python3.8/site-packages/distributed/threadpoolexecutor.py", line 55, in _worker task.run()

File "/usr/local/lib/python3.8/site-packages/distributed/_concurrent_futures_thread.py", line 66, in run result = self.fn(*self.args, **self.kwargs)

File "/usr/local/lib/python3.8/site-packages/distributed/worker.py", line 3722, in apply_function result = function(*args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/flow_runner.py", line 775, in run_task return task_runner.run(

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 425, in wrapper return func(*args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/cloud/task_runner.py", line 354, in run end_state = super().run(

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 425, in wrapper return func(*args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 298, in run state = self.get_task_run_state(state, inputs=task_inputs)

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 71, in inner return runner_method(self, *args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner new_state = method(self, state, *args, **kwargs)

File "/usr/local/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 865, in get_task_run_state value = prefect.utilities.executors.run_task_with_timeout(

File "/usr/local/lib/python3.8/site-packages/prefect/utilities/executors.py", line 323, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore

File "/oisst_recipe.py", line 39, in wrapper

File "/usr/local/lib/python3.8/site-packages/rechunker/executors/prefect.py", line 30, in run return self.stage.func(key)

File "/usr/local/lib/python3.8/site-packages/pangeo_forge_recipes/recipes/xarray_zarr.py", line 325, in store_chunk with lock_for_conflicts(lock_keys):

File "/usr/local/lib/python3.8/contextlib.py", line 113, in __enter__ return next(self.gen)

File "/usr/local/lib/python3.8/site-packages/pangeo_forge_recipes/utils.py", line 101, in lock_for_conflicts lock.acquire()

File "/usr/local/lib/python3.8/site-packages/distributed/lock.py", line 135, in acquire result = self.client.sync(

File "/usr/local/lib/python3.8/site-packages/distributed/client.py", line 853, in sync return sync(

File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 351, in sync e.wait(10)

File "/usr/local/lib/python3.8/threading.py", line 558, in wait signaled = self._cond.wait(timeout)

File "/usr/local/lib/python3.8/threading.py", line 306, in wait gotit = waiter.acquire(True, timeout)

So seems like the issue is in lock_for_conflicts. Either a real deadlock, or an event loop issue.

@TomAugspurger
Copy link
Author

With debug logs:

pangeo_forge_recipes.recipes.xarray_zarr - DEBUG - Acquiring locks ['time-23']

pangeo_forge_recipes.utils - DEBUG - Acquiring lock pangeo-forge-time-23...

pangeo_forge_recipes.utils - DEBUG - Acquired lock pangeo-forge-time-23

pangeo_forge_recipes.recipes.xarray_zarr - INFO - Storing variable time chunk (23,) to Zarr region (slice(460, 480, None),)

pangeo_forge_recipes.utils - DEBUG - Released lock pangeo-forge-time-23

pangeo_forge_recipes.recipes.xarray_zarr - DEBUG - Acquiring locks ['sst-23']

pangeo_forge_recipes.utils - DEBUG - Acquiring lock pangeo-forge-sst-23...

So AFAICT, the lock sst-23 was released prior that worker trying to acquire it.

@TomAugspurger
Copy link
Author

TomAugspurger commented Jun 9, 2021

Debugging, the lock is in fact available: (this is wrong: see #10 (comment))

# shell 1
$ kubectl port-forward -n pangeo-forge dask-root-b4bde08a-3rdfkz 8786

# shell 2
$ python
Python 3.8.10 | packaged by conda-forge | (default, May 11 2021, 07:01:05)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import distributed
>>> client = distributed.Client("tcp://127.0.0.1:8786")
>>> lock = distributed.Lock("sst-23")
>>> lock.acquire(timeout=5)
True
>>> lock.release()
>>>

I'd love to see what's going on on that worker... I think kubernetes lets you ssh into pods.

@sharkinsspatial
Copy link
Contributor

@TomAugspurger As a note all of the tests in pangeo-forge/pangeo-forge-recipes#151 were run using pangeo/pangeo-notebook:2021.05.04 as a base image with the following modifications.

xarray==0.18.0
pangeo-forge-recipes==0.3.4
prefect[aws, azure, google, kubernetes]==0.14.7

We did not experience any hanging worker issues beyond those outlined in pangeo-forge/pangeo-forge-recipes#144 so I'm unsure if this is azure blob storage related or perhaps due to a change in the pangeo-forge-recipes main branch. Do you have a branch which shows the mapper implementation you were using?

@CiaranEvans Can we investigate how to expose the external network interface for the worker pods and they key we'll need to set so Tom can ssh into a hung worker pod?

@TomAugspurger
Copy link
Author

I maybe got it with kubectl exec -it -n pangeo-forge dask-root-b4bde08a-32rpqk -- /bin/bash The terminal was hanging though.

I did another run and this time the lock actually seemed to be claimed. I'm seeing if we can get a bit better information into distributed's locks, to figure out who has what.

@ciaransweet
Copy link
Contributor

@TomAugspurger Had that pod died by the time you tried to connect?

It's not the case of Prefect/K8s killing it as soon as it errors?

@TomAugspurger
Copy link
Author

Nope, it was still alive. The only pods I've seen killed are (I think) due to #11.

@ciaransweet
Copy link
Contributor

Hmm okay, I'm unsure how kubectl and getting on a running container works tbh. If it can't get a connection, does it hang or will it error

@TomAugspurger
Copy link
Author

I did another run and this time the lock actually seemed to be claimed. I'm seeing if we can get a bit better information into distributed's locks, to figure out who has what.

I believe I was mistaken when I initially said that the lock was actually able to be acquired in #10 (comment). Pangeo-forge prepends the key pangeo-forge to the locks it acquires, so I should have acquired pangeo-forge-sst-23 rather than just sst-23.

My guess right now is that some other worker failed to release that lock. I'll try to confirm that.

@TomAugspurger
Copy link
Author

TomAugspurger commented Jun 10, 2021

I just did a run with no locks (by commenting out the locking in TomAugspurger/pangeo-forge@f22e73c) and it completed successfully. I think this recipe doesn't have any conflicting locks, but I need to confirm that . So a few things:

  1. Locks not being released might be a real problem. We should work with distributed to provide better diagnostics. In particular, I wonder if the LockExtension object could be updated to use the client.id of the client that acquires the lock, and lock acquisitions / releases should be logged to the /events dashboard.
    2. I didn't observe any memory issues. This might be because we're using the latest versions of adlfs, dask, distributed, fsspec, and pangeo-forge-recipes. Or it might be because we're using adlfs rather than s3fs. @sharkinsspatial can you confirm: did you ever see memory issues with the Azure bakery (I think not, since you were hitting the adlfs / fsspec caching issue)? Sorry, I forgot I made the date_range smaller for debugging. I'll try with the full range now.
  2. pangeo-forge-recipes might be able to disable locking, if it knows that there aren't any conflicts across the entire recipe. I'll see if this is doable.

@sharkinsspatial
Copy link
Contributor

@TomAugspurger If you would like a simpler example to debug worker memory leaks you can use https://github.com/pangeo-forge/pangeo-forge-azure-bakery/blob/dask_memory_flow/flow_test/dask_memory_flow.py so that we can isolate this issue from any interactions with other dependencies.

@TomAugspurger TomAugspurger reopened this Jun 10, 2021
@TomAugspurger
Copy link
Author

FYI @sharkinsspatial I tried that dask_memory_flow.py flow and did not see any issues. Notes below

The scheduler pod is currently being killed for OOM
(0d751e5, https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/50d52570-449c-4083-9fd3-cbc7efa0ea91)

The dask_memory_flow, which just maps a bunch of sleeps, looks pretty fine.
https://cloud.prefect.io/flow-run/aab25c7f-093d-498a-9266-1d5f5560a0a4 /
b20c361
Memory use on the scheduler is essentially flat. Small jumps (2-3 MB) when a
group of workers is added. Small increase per "tick" of 0.08MB. I wonder if
that stabilizes.

Memory use climbed to ~380 - 385MiB by the end.

So the summary is that prefect seems to be doing OK with that number of tasks. I followed that up with a test that subclassed XarrayToZarr and override prepare_target and store_chunk to just time.sleep. The goal was to verify that it's the pangeo-forge recipe object causing issues with scheduler memory, rather than actual execution itself. With that setup, the scheduler memory jumped to 4.5 GiB within 25s, before being OOM-killed by kubernetes.

@sharkinsspatial
Copy link
Contributor

@TomAugspurger 👍 Thanks for the investigation. Can you report the distributed, dask and prefect versions on the worker image used for your dask_memory_flow test? I'll run another AWS test using this version combination so that we can verify if this worker memory issue is specific to the dask-cloudprovider ECSCluster.

Based on your subclass experiment it appears that recipe serialization may be the culprit in our scheduler memory woes. What are your thoughts on next steps? If you want to continue to coordinate with @rabernat on testing and potential solutions for this, I'll work with the Prefect team to try and diagnose our AWS worker memory issues this week.

@TomAugspurger
Copy link
Author

TomAugspurger commented Jun 15, 2021

I'll verify, but this was using commit b20c361, so dask versions are at

dask==2021.6.0
dask-kubernetes==2021.3.1
distributed==2021.6.0
(2021.6.0) and prefect is at (0.14.19)

What are your thoughts on next steps?

I think the two next options are one or both of

  1. Try out New implementation of Dask & Prefect executors pangeo-forge-recipes#153 on the sleep flow. Should be pretty easy. I can do that next
  2. Restructure the Recipe class to be a bit simpler in what it passes around. Basically the tasks would be top-level functions that just take some arguments, rather than passing the Recipe object around. This is more work.

@TomAugspurger
Copy link
Author

TomAugspurger commented Jun 15, 2021

I started work on option the two next steps mentioned in #10 (comment) at this branch: https://github.com/TomAugspurger/pangeo-forge/tree/refactor.

That builds on pangeo-forge/pangeo-forge-recipes#153 adding a to_dask() and to_prefect(). It simplifies the objects that end up in a prefect Task / task graph by moving essentially removing self from all the functions and instead explicitly passes the arguments around.

I'm doing a run right now (using the XarrayToZarr that just sleeps instead of writing). It's halfway through the store_chunk stage, and memory use on the scheduler is steady at 1.9 GB. Since it's running well, I think we're able to confirm that the XarrayToZarr objects were the ones causing the serialization issues.

@TomAugspurger
Copy link
Author

@sharkinsspatial I'm picking up this debugging a bit, to verify what fixes the memory issue. My plan is to run a recipe that has the fixed FilePattern avoiding the filepattern_from_sequenc helper, with an input_cache that just sleeps (doesn't actually cache):

class SleepingInputCache(CacheFSSpecTarget):
    def cache_file(self, fname, **fsspec_open_kwargs):
        time.sleep(1)
        return

Then I'll run that with three versions of pangeo-forge-recipes:

  1. pangeo-forge-recipes master
  2. New implementation of Dask & Prefect executors pangeo-forge-recipes#153
  3. Refactor XarrayZarrRecipe to be serialization-friendly pangeo-forge-recipes#160

I'll post updates here.

@TomAugspurger
Copy link
Author

OK, here are the results. For each of these I built a docker image and submitted & ran the flow.

Test Outcome Prefect Link (probably not public) Git Commit
pangeo-forge-recipes master OOM-Killed scheduler at ~3GB after 60s https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/36ab5e8a-7605-49a3-bf85-45c58b7e7374 11d605e
Ryan's PR OOM-Killed scheduler at ~3GB after 60s https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/242b7ad7-8932-4172-ac53-a567b605935d 11d605e
Tom's PR Stable after 15 minutes / ~20k tasks https://cloud.prefect.io/tom-w-augspurger-gmail-com-s-account/flow-run/cff125b1-2318-4fc6-b797-81129b0eb441 a0576f4

So tl/dr, pangeo-forge/pangeo-forge-recipes#160 fixes the memory issues on the scheduler, and seems to be necessary.

A small note, workers are building up a bunch of unmanaged memory. That surprises me, since we're just sleeping. This might need more investigation down the road.

@rabernat
Copy link

Thanks so much for doing this forensic work Tom! We will go with pangeo-forge/pangeo-forge-recipes#160.

@sharkinsspatial
Copy link
Contributor

Thanks @TomAugspurger. It appears that your PR will solve the scheduler memory growth issues associated with serialization 🎊 . As you noted above we are still seeing incremental memory growth on workers (even without actual activity) as originally noted here. This is problematic with several of our recipes as the worker memory growth over a large number of task executions will result in eventual worker OOM failures (which we were seeing in out initial OISST testing). I'll continue tracking this here and touch base with the Prefect team again to see if they have made any progress on their investigations.

@tracetechnical
Copy link
Contributor

Hi All,

Hopefully the above should be addressed in #21 when it is reviewed and merged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants