You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
These ongoing issues have been discussed in several disparate threads pangeo-forge/pangeo-forge-azure-bakery#10 and pangeo-forge/gpm-imerge-hhr-feedstock#1 as well as others and several ongoing threads on the Prefect Slack channels but I am opening this issue to centralize tracking of these problems and improve visibility for the pangeo-forge team. To summarize, there appears to be an underlying issue with the Prefect DaskExecutorhttps://github.com/PrefectHQ/prefect/blob/master/src/prefect/executors/dask.py which causes Prefect tasks to consume a large amount of memory after serialization and submission causing Dask scheduler OOM issues. To provide a basic example
from prefect import Flow, task
from time import sleep
@task
def map_fn(x):
sleep(0.25)
task_range = list(range(1, 400000))
with Flow('map_testing') as flow:
map_fn.map(task_range)
Results in the scheduler pod (which is configured with 10GB of memory) being killed almost instantly with an OOM error. Additionally we see worker unmanaged memory growing rapidly. As shown in the image above, unmanaged memory has inexplicably grown to 11GB after only 311 tasks have been processed.
In an attempt to verify that this is a Prefect specific issue and not an underlying issue with rapid Dask task submission, I ran a small experiment to approximate the task submission behavior used by Prefect's DaskExecutor.
from dask_kubernetes import KubeCluster, make_pod_spec
from distributed import Client
from time import sleep
import weakref
futures = weakref.WeakSet()
pod_spec = make_pod_spec(
image=worker_image,
memory_limit='4G',
memory_request='4G',
cpu_limit=1,
cpu_request=1,
env={}
)
cluster = KubeCluster(pod_spec)
cluster.adapt(minimum=4, maximum=10)
client = Client(cluster)
def map_fn(x):
sleep(0.25)
value_range = list(range(1, 400000))
for value in value_range:
future = client.submit(map_fn, value)
futures.add(future)
client.gather(futures)
I did not encounter the memory spikes demonstrated in the equivalent Prefect operation using the DaskExecutor. This operation completed in approx 60s on 4 workers with no memory pressure.
Currently, @kvnkho from the Prefect team is investigating this and has reproduced these issues using their own cluster infrastructure. I will try to post updates here as we get more information around this.
The text was updated successfully, but these errors were encountered:
These ongoing issues have been discussed in several disparate threads pangeo-forge/pangeo-forge-azure-bakery#10 and pangeo-forge/gpm-imerge-hhr-feedstock#1 as well as others and several ongoing threads on the Prefect Slack channels but I am opening this issue to centralize tracking of these problems and improve visibility for the
pangeo-forge
team. To summarize, there appears to be an underlying issue with the PrefectDaskExecutor
https://github.com/PrefectHQ/prefect/blob/master/src/prefect/executors/dask.py which causes Prefect tasks to consume a large amount of memory after serialization and submission causing Dask scheduler OOM issues. To provide a basic exampleResults in the scheduler pod (which is configured with 10GB of memory) being killed almost instantly with an OOM error. Additionally we see worker unmanaged memory growing rapidly. As shown in the image above, unmanaged memory has inexplicably grown to 11GB after only 311 tasks have been processed.
In an attempt to verify that this is a Prefect specific issue and not an underlying issue with rapid Dask task submission, I ran a small experiment to approximate the task submission behavior used by Prefect's
DaskExecutor
.I did not encounter the memory spikes demonstrated in the equivalent Prefect operation using the
DaskExecutor
. This operation completed in approx 60s on 4 workers with no memory pressure.Currently, @kvnkho from the Prefect team is investigating this and has reproduced these issues using their own cluster infrastructure. I will try to post updates here as we get more information around this.
The text was updated successfully, but these errors were encountered: