Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dagster-webserver memory leak #18997

Open
aaaaahaaaaa opened this issue Jan 3, 2024 · 38 comments
Open

dagster-webserver memory leak #18997

aaaaahaaaaa opened this issue Jan 3, 2024 · 38 comments
Labels
type: bug Something isn't working

Comments

@aaaaahaaaaa
Copy link

Dagster version

1.5.13

What's the issue?

dagster-webserver 1.5.13 seems to have some kind of memory leak. Since we updated to that version, we can observe a steady increase in memory usage over the last couple of weeks.

  • The increase in memory usage correlates to the change of version, without any other change being introduced.
  • We observe the same behaviour on 2 different GKE clusters.
  • Reverting to 1.5.12 resolves the issue.

image
image

What did you expect to happen?

No response

How to reproduce?

No response

Deployment type

Dagster Helm chart

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

@aaaaahaaaaa aaaaahaaaaa added the type: bug Something isn't working label Jan 3, 2024
@alangenfeld
Copy link
Member

alangenfeld commented Jan 3, 2024

I don't see any notable commits in 1.5.13 on initial inspection

Reverting to 1.5.12 resolves the issue.

How exactly did you do this? Can you report the python environments in the two containers (pip list / pip freeze) ? Trying to discern if its possible that the leak is from a dependency that also changed between the two container images.

@aaaaahaaaaa
Copy link
Author

How exactly did you do this?

We changed the helm chart version. We literally just reverted the Renovate bot commit.

1.5.12

pip list

Package                     Version
--------------------------- ------------
alembic                     1.13.0
amqp                        5.2.0
aniso8601                   9.0.1
annotated-types             0.6.0
anyio                       4.1.0
async-timeout               4.0.3
azure-core                  1.29.5
azure-identity              1.15.0
azure-storage-blob          12.19.0
azure-storage-file-datalake 12.14.0
backoff                     2.2.1
billiard                    4.2.0
boto3                       1.33.12
botocore                    1.33.12
cachetools                  5.3.2
celery                      5.3.6
certifi                     2023.11.17
cffi                        1.16.0
charset-normalizer          3.3.2
click                       8.1.7
click-didyoumean            0.3.0
click-plugins               1.1.1
click-repl                  0.3.0
coloredlogs                 14.0
croniter                    2.0.1
cryptography                41.0.7
dagster                     1.5.12
dagster-aws                 0.21.12
dagster-azure               0.21.12
dagster-celery              0.21.12
dagster-celery-k8s          0.21.12
dagster-gcp                 0.21.12
dagster-graphql             1.5.12
dagster-k8s                 0.21.12
dagster-pandas              0.21.12
dagster-pipes               1.5.12
dagster-postgres            0.21.12
dagster-webserver           1.5.12
db-dtypes                   1.1.1
docstring-parser            0.15
exceptiongroup              1.2.0
flower                      2.0.1
fsspec                      2023.12.2
google-api-core             2.15.0
google-api-python-client    2.110.0
google-auth                 2.25.2
google-auth-httplib2        0.1.1
google-cloud-bigquery       3.13.0
google-cloud-core           2.4.1
google-cloud-storage        2.13.0
google-crc32c               1.5.0
google-resumable-media      2.6.0
googleapis-common-protos    1.62.0
gql                         3.4.1
graphene                    3.3
graphql-core                3.2.3
graphql-relay               3.2.0
greenlet                    3.0.2
grpcio                      1.60.0
grpcio-health-checking      1.60.0
grpcio-status               1.60.0
h11                         0.14.0
httplib2                    0.22.0
httptools                   0.6.1
humanfriendly               10.0
humanize                    4.9.0
idna                        3.6
isodate                     0.6.1
Jinja2                      3.1.2
jmespath                    1.0.1
kombu                       5.3.4
kubernetes                  28.1.0
Mako                        1.3.0
MarkupSafe                  2.1.3
msal                        1.26.0
msal-extensions             1.1.0
multidict                   6.0.4
numpy                       1.26.2
oauth2client                4.1.3
oauthlib                    3.2.2
packaging                   23.2
pandas                      2.1.4
pendulum                    2.1.2
pip                         23.0.1
portalocker                 2.8.2
prometheus-client           0.19.0
prompt-toolkit              3.0.41
proto-plus                  1.23.0
protobuf                    4.25.1
psycopg2-binary             2.9.9
pyarrow                     14.0.1
pyasn1                      0.5.1
pyasn1-modules              0.3.0
pycparser                   2.21
pydantic                    2.5.2
pydantic_core               2.14.5
PyJWT                       2.8.0
pyparsing                   3.1.1
python-dateutil             2.8.2
python-dotenv               1.0.0
pytz                        2023.3.post1
pytzdata                    2020.1
PyYAML                      6.0.1
redis                       5.0.1
requests                    2.31.0
requests-oauthlib           1.3.1
requests-toolbelt           0.10.1
rsa                         4.9
s3transfer                  0.8.2
setuptools                  65.5.1
six                         1.16.0
sniffio                     1.3.0
SQLAlchemy                  2.0.23
starlette                   0.33.0
tabulate                    0.9.0
tomli                       2.0.1
toposort                    1.10
tornado                     6.4
tqdm                        4.66.1
typing_extensions           4.9.0
tzdata                      2023.3
universal-pathlib           0.1.4
uritemplate                 4.1.1
urllib3                     1.26.18
uvicorn                     0.24.0.post1
uvloop                      0.19.0
vine                        5.1.0
watchdog                    3.0.0
watchfiles                  0.21.0
wcwidth                     0.2.12
websocket-client            1.7.0
websockets                  12.0
wheel                       0.42.0
yarl                        1.9.4

pip freeze

alembic==1.13.0
amqp==5.2.0
aniso8601==9.0.1
annotated-types==0.6.0
anyio==4.1.0
async-timeout==4.0.3
azure-core==1.29.5
azure-identity==1.15.0
azure-storage-blob==12.19.0
azure-storage-file-datalake==12.14.0
backoff==2.2.1
billiard==4.2.0
boto3==1.33.12
botocore==1.33.12
cachetools==5.3.2
celery==5.3.6
certifi==2023.11.17
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
coloredlogs==14.0
croniter==2.0.1
cryptography==41.0.7
dagster==1.5.12
dagster-aws==0.21.12
dagster-azure==0.21.12
dagster-celery==0.21.12
dagster-celery-k8s==0.21.12
dagster-gcp==0.21.12
dagster-graphql==1.5.12
dagster-k8s==0.21.12
dagster-pandas==0.21.12
dagster-pipes==1.5.12
dagster-postgres==0.21.12
dagster-webserver==1.5.12
db-dtypes==1.1.1
docstring-parser==0.15
exceptiongroup==1.2.0
flower==2.0.1
fsspec==2023.12.2
google-api-core==2.15.0
google-api-python-client==2.110.0
google-auth==2.25.2
google-auth-httplib2==0.1.1
google-cloud-bigquery==3.13.0
google-cloud-core==2.4.1
google-cloud-storage==2.13.0
google-crc32c==1.5.0
google-resumable-media==2.6.0
googleapis-common-protos==1.62.0
gql==3.4.1
graphene==3.3
graphql-core==3.2.3
graphql-relay==3.2.0
greenlet==3.0.2
grpcio==1.60.0
grpcio-health-checking==1.60.0
grpcio-status==1.60.0
h11==0.14.0
httplib2==0.22.0
httptools==0.6.1
humanfriendly==10.0
humanize==4.9.0
idna==3.6
isodate==0.6.1
Jinja2==3.1.2
jmespath==1.0.1
kombu==5.3.4
kubernetes==28.1.0
Mako==1.3.0
MarkupSafe==2.1.3
msal==1.26.0
msal-extensions==1.1.0
multidict==6.0.4
numpy==1.26.2
oauth2client==4.1.3
oauthlib==3.2.2
packaging==23.2
pandas==2.1.4
pendulum==2.1.2
portalocker==2.8.2
prometheus-client==0.19.0
prompt-toolkit==3.0.41
proto-plus==1.23.0
protobuf==4.25.1
psycopg2-binary==2.9.9
pyarrow==14.0.1
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==2.5.2
pydantic_core==2.14.5
PyJWT==2.8.0
pyparsing==3.1.1
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
pytzdata==2020.1
PyYAML==6.0.1
redis==5.0.1
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
rsa==4.9
s3transfer==0.8.2
six==1.16.0
sniffio==1.3.0
SQLAlchemy==2.0.23
starlette==0.33.0
tabulate==0.9.0
tomli==2.0.1
toposort==1.10
tornado==6.4
tqdm==4.66.1
typing_extensions==4.9.0
tzdata==2023.3
universal-pathlib==0.1.4
uritemplate==4.1.1
urllib3==1.26.18
uvicorn==0.24.0.post1
uvloop==0.19.0
vine==5.1.0
watchdog==3.0.0
watchfiles==0.21.0
wcwidth==0.2.12
websocket-client==1.7.0
websockets==12.0
yarl==1.9.4

1.5.13

pip list

Package                     Version
--------------------------- ------------
alembic                     1.13.0
amqp                        5.2.0
aniso8601                   9.0.1
annotated-types             0.6.0
anyio                       4.1.0
async-timeout               4.0.3
azure-core                  1.29.5
azure-identity              1.15.0
azure-storage-blob          12.19.0
azure-storage-file-datalake 12.14.0
backoff                     2.2.1
billiard                    4.2.0
boto3                       1.34.0
botocore                    1.34.0
cachetools                  5.3.2
celery                      5.3.6
certifi                     2023.11.17
cffi                        1.16.0
charset-normalizer          3.3.2
click                       8.1.7
click-didyoumean            0.3.0
click-plugins               1.1.1
click-repl                  0.3.0
coloredlogs                 14.0
croniter                    2.0.1
cryptography                41.0.7
dagster                     1.5.13
dagster-aws                 0.21.13
dagster-azure               0.21.13
dagster-celery              0.21.13
dagster-celery-k8s          0.21.13
dagster-gcp                 0.21.13
dagster-graphql             1.5.13
dagster-k8s                 0.21.13
dagster-pandas              0.21.13
dagster-pipes               1.5.13
dagster-postgres            0.21.13
dagster-webserver           1.5.13
db-dtypes                   1.2.0
docstring-parser            0.15
exceptiongroup              1.2.0
flower                      2.0.1
fsspec                      2023.12.2
google-api-core             2.15.0
google-api-python-client    2.111.0
google-auth                 2.25.2
google-auth-httplib2        0.2.0
google-cloud-bigquery       3.14.1
google-cloud-core           2.4.1
google-cloud-storage        2.14.0
google-crc32c               1.5.0
google-resumable-media      2.7.0
googleapis-common-protos    1.62.0
gql                         3.4.1
graphene                    3.3
graphql-core                3.2.3
graphql-relay               3.2.0
greenlet                    3.0.2
grpcio                      1.60.0
grpcio-health-checking      1.60.0
h11                         0.14.0
httplib2                    0.22.0
httptools                   0.6.1
humanfriendly               10.0
humanize                    4.9.0
idna                        3.6
isodate                     0.6.1
Jinja2                      3.1.2
jmespath                    1.0.1
kombu                       5.3.4
kubernetes                  28.1.0
Mako                        1.3.0
MarkupSafe                  2.1.3
msal                        1.26.0
msal-extensions             1.1.0
multidict                   6.0.4
numpy                       1.26.2
oauth2client                4.1.3
oauthlib                    3.2.2
packaging                   23.2
pandas                      2.1.4
pendulum                    2.1.2
pip                         23.0.1
portalocker                 2.8.2
prometheus-client           0.19.0
prompt-toolkit              3.0.43
protobuf                    4.25.1
psycopg2-binary             2.9.9
pyarrow                     14.0.1
pyasn1                      0.5.1
pyasn1-modules              0.3.0
pycparser                   2.21
pydantic                    2.5.2
pydantic_core               2.14.5
PyJWT                       2.8.0
pyparsing                   3.1.1
python-dateutil             2.8.2
python-dotenv               1.0.0
pytz                        2023.3.post1
pytzdata                    2020.1
PyYAML                      6.0.1
redis                       5.0.1
requests                    2.31.0
requests-oauthlib           1.3.1
requests-toolbelt           0.10.1
rsa                         4.9
s3transfer                  0.9.0
setuptools                  65.5.1
six                         1.16.0
sniffio                     1.3.0
SQLAlchemy                  2.0.23
starlette                   0.33.0
tabulate                    0.9.0
tomli                       2.0.1
toposort                    1.10
tornado                     6.4
tqdm                        4.66.1
typing_extensions           4.9.0
tzdata                      2023.3
universal-pathlib           0.1.4
uritemplate                 4.1.1
urllib3                     1.26.18
uvicorn                     0.24.0.post1
uvloop                      0.19.0
vine                        5.1.0
watchdog                    3.0.0
watchfiles                  0.21.0
wcwidth                     0.2.12
websocket-client            1.7.0
websockets                  12.0
wheel                       0.42.0
yarl                        1.9.4

pip freeze

alembic==1.13.0
amqp==5.2.0
aniso8601==9.0.1
annotated-types==0.6.0
anyio==4.1.0
async-timeout==4.0.3
azure-core==1.29.5
azure-identity==1.15.0
azure-storage-blob==12.19.0
azure-storage-file-datalake==12.14.0
backoff==2.2.1
billiard==4.2.0
boto3==1.34.0
botocore==1.34.0
cachetools==5.3.2
celery==5.3.6
certifi==2023.11.17
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
click-didyoumean==0.3.0
click-plugins==1.1.1
click-repl==0.3.0
coloredlogs==14.0
croniter==2.0.1
cryptography==41.0.7
dagster==1.5.13
dagster-aws==0.21.13
dagster-azure==0.21.13
dagster-celery==0.21.13
dagster-celery-k8s==0.21.13
dagster-gcp==0.21.13
dagster-graphql==1.5.13
dagster-k8s==0.21.13
dagster-pandas==0.21.13
dagster-pipes==1.5.13
dagster-postgres==0.21.13
dagster-webserver==1.5.13
db-dtypes==1.2.0
docstring-parser==0.15
exceptiongroup==1.2.0
flower==2.0.1
fsspec==2023.12.2
google-api-core==2.15.0
google-api-python-client==2.111.0
google-auth==2.25.2
google-auth-httplib2==0.2.0
google-cloud-bigquery==3.14.1
google-cloud-core==2.4.1
google-cloud-storage==2.14.0
google-crc32c==1.5.0
google-resumable-media==2.7.0
googleapis-common-protos==1.62.0
gql==3.4.1
graphene==3.3
graphql-core==3.2.3
graphql-relay==3.2.0
greenlet==3.0.2
grpcio==1.60.0
grpcio-health-checking==1.60.0
h11==0.14.0
httplib2==0.22.0
httptools==0.6.1
humanfriendly==10.0
humanize==4.9.0
idna==3.6
isodate==0.6.1
Jinja2==3.1.2
jmespath==1.0.1
kombu==5.3.4
kubernetes==28.1.0
Mako==1.3.0
MarkupSafe==2.1.3
msal==1.26.0
msal-extensions==1.1.0
multidict==6.0.4
numpy==1.26.2
oauth2client==4.1.3
oauthlib==3.2.2
packaging==23.2
pandas==2.1.4
pendulum==2.1.2
portalocker==2.8.2
prometheus-client==0.19.0
prompt-toolkit==3.0.43
protobuf==4.25.1
psycopg2-binary==2.9.9
pyarrow==14.0.1
pyasn1==0.5.1
pyasn1-modules==0.3.0
pycparser==2.21
pydantic==2.5.2
pydantic_core==2.14.5
PyJWT==2.8.0
pyparsing==3.1.1
python-dateutil==2.8.2
python-dotenv==1.0.0
pytz==2023.3.post1
pytzdata==2020.1
PyYAML==6.0.1
redis==5.0.1
requests==2.31.0
requests-oauthlib==1.3.1
requests-toolbelt==0.10.1
rsa==4.9
s3transfer==0.9.0
six==1.16.0
sniffio==1.3.0
SQLAlchemy==2.0.23
starlette==0.33.0
tabulate==0.9.0
tomli==2.0.1
toposort==1.10
tornado==6.4
tqdm==4.66.1
typing_extensions==4.9.0
tzdata==2023.3
universal-pathlib==0.1.4
uritemplate==4.1.1
urllib3==1.26.18
uvicorn==0.24.0.post1
uvloop==0.19.0
vine==5.1.0
watchdog==3.0.0
watchfiles==0.21.0
wcwidth==0.2.12
websocket-client==1.7.0
websockets==12.0
yarl==1.9.4

@alangenfeld
Copy link
Member

Thanks for following up, not much interesting in the dependency changes.

I spent some time with memray looking for leaks and have so far not been able to turn anything up.

Do you have anything like automated recurring queries against the webserver?

@aaaaahaaaaa
Copy link
Author

Do you have anything like automated recurring queries against the webserver?

Well only the readinessProbe from your chart.

Turns out we actually still observe the same behaviour after rolling back to 1.5.12. So it's not related to the new version. I'm puzzled now. I'll try to investigate further and close the issue.

@alangenfeld
Copy link
Member

alangenfeld commented Jan 4, 2024

I've had luck using this tool to get a memory profile of a running process https://github.com/facebookarchive/memory-analyzer and this https://github.com/kmaork/madbg for interactive poking around at the active process. I believe these both need SYS_PTRACE capabilities given on the k8s pod spec.

Given its a webserver its also susceptible to the "type 3" leaks described here https://blog.nelhage.com/post/three-kinds-of-leaks/ python allocator arena fragmentation, but the very smooth gradient of your graphs makes me skeptical thats the cause without some sort of recurring large query causing the fragmentation.

@jvyoralek
Copy link
Contributor

@aaaaahaaaaa did you find any reason why memory started growing? We have a similar issue and switching between versions didn't help yet - tried from 1.5.14 to 1.5.12.

The memory increase is quite noticeable, showing up even in daily granularity.

This issue seems to be isolated to the webserver component. Both the daemon and code servers are exhibiting stable memory usage. We are operating these as three separate containers within AWS ECS.

We have only one scheduled job active, no sensors, auto-materialized so far. Assets are loaded from dbt.

SCR-20240119-iqbx

@aaaaahaaaaa
Copy link
Author

@jvyoralek No I didn't find the source of the problem and the issue is still occurring for us as well. Unfortunately I didn't have time to investigate further. I think there's clearly something up with the workload, we're not doing anything special either aside from deploying the helm chart.

@salazarm
Copy link
Contributor

@alangenfeld found a memory leak that could be the cause of this, I'll let him comment but here is the PR that attempts to fix it #19298

@alangenfeld
Copy link
Member

#19298 is a fix for a problem that manifests as very rapid unbounded memory growth resulting in process termination. I don't believe its related to this slower memory growth.

@noam-jacobson
Copy link

noam-jacobson commented Jan 25, 2024

I appear to have a similar problem after upgrading to 1.6. I run Dagster on AWS ECS using Fargate. Hence I don't believe it is my jobs causing it since the code runs on a separate task. Both the Daemon and Dagit/Web server, services, are slowly creeping up. The drops in the following chart is due to restarts. Before the upgrade to 1.6 on the 11th this problem didn't exist.
image

@alangenfeld
Copy link
Member

@noam-jacobson what version were you upgrading from?

@noam-jacobson
Copy link

@noam-jacobson what version were you upgrading from?

I was on version 1.5.10

@jackwillisupside
Copy link

@noam-jacobson We're having the same issue on ECS/Fargate on 1.5.7

@will-regal-voice
Copy link

We are also having the same issue on 1.6.0, also ECS/Fargate

@gasgallo
Copy link

gasgallo commented Feb 2, 2024

Same here in our k8s deployment cluster. Any clue?

@jackwillisupside
Copy link

We think we might? have solved it on our end -- we didn't have a strict retention policy on logs set in our dagster.yml and once we set it to below our memory stopped growing:

retention:
  schedule:
    purge_after_days: 90 # sets retention policy for schedule ticks of all types
  sensor:
    purge_after_days:
      skipped: 7
      failure: 90
      success: 365

@aaaaahaaaaa aaaaahaaaaa changed the title dagster-webserver 1.5.13 memory leak dagster-webserver memory leak Feb 9, 2024
@gasgallo
Copy link

gasgallo commented Feb 16, 2024

We think we might? have solved it on our end -- we didn't have a strict retention policy on logs set in our dagster.yml and once we set it to below our memory stopped growing:

retention:
  schedule:
    purge_after_days: 90 # sets retention policy for schedule ticks of all types
  sensor:
    purge_after_days:
      skipped: 7
      failure: 90
      success: 365

How did that impact your memory usage? Technically you'll still retain ticks for up to 365 days, thus you should not see a change in behavior in just a few days. Or did I miss something?

I've applied a similar setting on my deployment as well (way stricter than yours, for testing) and my memory is still going up, same as before.

@alexknorr
Copy link

Same problem here on Open-Shift with nearly same packages (dagster 1.6.5), also PostgreSQL and slim-buster images on both daemon and dagster-webserver (separate pods).
Tried with python 3.10, 3.11 and sqlalchemy<2.0 + >2.0, no luck so far, crashes every 3-4 days.
Currently trying with python 3.12, dagster 1.6.6 and slim-bookworm, will see more next days...

@stasharrofi
Copy link

stasharrofi commented Feb 28, 2024

EDIT: We found out that the following is actually not working. The initial indication might have just been a fluke.

We were having this issue and I believe that we have found the root cause to be a bug in anyio which leaked processes. The bug was introduced in 4.1.0 and fixed in 4.3.0 (last week): agronholm/anyio#669

Dagster has a dependency on anyio through the following chain: dagit --> dagster-webserver --> starlette --> anyio and I believe that this issue started to appear for people whenever they rebuilt their Dagster image during the time that bug was present because a newer but buggy version of anyio would have been included in their docker image.

So, the solution could be to either explicitly require anyio >= 4.3.0 or to wait until people rebuild their docker images and automatically get the bug-fixed version.

@jvyoralek
Copy link
Contributor

Has anyone had success with the solution recommended by @stasharrofi ?

We have made changes, but it appears that the memory usage is still increasing.

image

I see anyio 4.3 in log

#12 1.757 Collecting dagster==1.6.6
#12 1.810   Downloading dagster-1.6.6-py3-none-any.whl (1.4 MB)
#12 1.852      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 36.1 MB/s eta 0:00:00
#12 2.037 Collecting dagster-aws==0.22.6
#12 2.042   Downloading dagster_aws-0.22.6-py3-none-any.whl (109 kB)
#12 2.048      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 109.8/109.8 kB 32.6 MB/s eta 0:00:00
#12 2.214 Collecting dagster-postgres==0.22.6
#12 2.219   Downloading dagster_postgres-0.22.6-py3-none-any.whl (20 kB)
#12 2.259 Collecting anyio==4.3.0
#12 2.263   Downloading anyio-4.3.0-py3-none-any.whl (85 kB)

@noam-jacobson
Copy link

@jvyoralek It hasn't worked for me. Deployed the newest Dagster version 1.6.6 with anyio-4.3.0.

@stasharrofi
Copy link

@jvyoralek : No, we found out that it's not working for us either. The initial indication that it was working was probably just a fluke.

@shivonchain
Copy link
Contributor

Same issue here with an ECS deployment, packages and versions included below

image

dagster==1.6.10
dagster-graphql==1.6.10
dagster-webserver==1.6.10
dagster-postgres==0.22.10
dagster-docker==0.22.10

@jobicarter
Copy link
Contributor

My team experienced this issue in an OSS ECS deployment after an upgrade from 1.5.9 -> 1.6.8. It impacted the dagit/webserver and daemon services, but not independent grpc/code location services. It presented as a slow leak that would increase memory utilization over a week or so until hitting critical thresholds / crashing the service, with 1gb memory allocated to services.

We "resolved" the issue in our environments by downgrading and pinning the grpcio python package to 1.57.0.

In incremental tests we downgraded our docker image base to the image version/sha we used for our 1.5.9 deployment, reverted dagster packages from 1.6.8 back to 1.5.9, and updated python from 3.10 -> 3.11. None of these changes resolved the memory leak.

Sharing this context as it supports root cause being related to an unpinned package dependency, and not necessarily an issue with the core dagster packages. It also ruled out interaction with OS libs/OS version causing the leak.

We selected grpcio 1.57.0 because it was the version of the dep that was solved for at the time when we originally deployed 1.5.9. It's possible a more recent version would work as well.

@jvyoralek
Copy link
Contributor

jvyoralek commented Apr 10, 2024

Thank you, @jobicarter, for the effective workaround. We deployed it yesterday, and although it's only been a short time, we're already seeing promising changes.

Tested with these versions:

dagster==1.7.0
dagster-webserver==1.7.0
dagster-graphql==1.7.0
dagster-aws==0.23.0
dagster-postgres==0.23.0
grpcio==1.57.0
image

@csomh
Copy link

csomh commented Apr 18, 2024

I can confirm that downgrading grpcio to 1.57.0 stops the leak.

dagster==1.5.14
dagster-aws==0.21.14
dagster-azure==0.21.14
dagster-celery==0.21.14
dagster-celery-k8s==0.21.14
dagster-gcp==0.21.14
dagster-graphql==1.5.14
dagster-k8s==0.21.14
dagster-pandas==0.21.14
dagster-pipes==1.5.14
dagster-postgres==0.21.14
dagster-webserver==1.5.14
grpcio==1.57.0
grpcio-health-checking==1.57.0

We also did try to upgrade it to 1.62.1, but that didn't seem to work.

@G14rb
Copy link

G14rb commented Apr 19, 2024

Thanks for the solution, I think this could be related to the dagster issue, grpc/grpc#36117

@p-y-t-h-e-c
Copy link

p-y-t-h-e-c commented May 14, 2024

Hi All, Having similar issue with the Dagster Docker deployment to Oracle VM. Unfortunately downgrading grpcio to 1.57.0 version hasn't resolved the issue. Currently using following setup for the Dagster image.
Screenshot 2024-05-14 134923
VM seems to get to OOM state circa every 8hrs now.

@rensoostenbachBL
Copy link

rensoostenbachBL commented Aug 2, 2024

We are running into the same issue on our Kubernetes cluster, having installed Dagster via the Helm chart.

Is the solution to downgrade grpcio for the dagster-webserver pod? In that case, we should build a custom Dockerfile that changes the dependencies and point to that Dockerfile in the Helm chart right?

I don't understand why Dagster hasn't pinned the grpcio version themselves to prevent this issue from happening, it seems a little strange that they are expecting users to either live with the memory leak, or manually fix the dependencies themselves.

@JanEgner
Copy link

JanEgner commented Sep 3, 2024

Just to add my 2 cents': running dagster 1.7.16/dbt/dagster-webserver all in one k8s pod.

image

I admit that it is somewhat inconclusive since some memory increase (but also a kind of garbage collection releasing much of the extra memory at a point) was visible before the last restart while using grpcio 1.57.0. Still, overall it looks way better than with grpcio 1.60.

It seems to be a workaround for now, but with at least two drawbacks (other than using an outdated component at all):

  • grpcio 1.57.0 does not support python 3.12
  • grpcio 1.57.0 has at least one known vulnerability (CVE-2024-7246) that might or might not affect you, depending on your setup.

@bolinzzz
Copy link

We started noticing memory leaks in certain code locations after upgrading to Dagster 1.8. Could grpcio potentially be contributing to these leaks?

We're still investigating, but I’d like to rule out this possibility.

@babaMar
Copy link

babaMar commented Nov 7, 2024

We're observing the same behavior deploying Dagster via Helm on K8s cluster. Building a custom image and downgrading grpcio seems like a step back to be honest.

@auguste-elax
Copy link

hi there 👋 I'm also experiencing what looks like memory leaks on specifically the webserver and daemon (running dagster 1.7.7 on k8s with helm). Has a solution been found apart from downgrading and pinning grpcio version ?

@gibsondan
Copy link
Member

Hi all, apologies for the delay here. We're adding a grpcio pin to the provided Helm chart images in the first Dagster OSS release in the new year, and we've filed an issue on the grpc repo with a lead on the underlying memory issue that may have been introduced in grpcio 1.57.1.

@theelderbeever
Copy link

Hi all, apologies for the delay here. We're adding a grpcio pin to the provided Helm chart images in the first Dagster OSS release in the new year, and we've filed an issue on the grpc repo with a lead on the underlying memory issue that may have been introduced in grpcio 1.57.1.

@gibsondan will you be backporting that pin to previous docker image versions or just on the latest ones? ie: if I am on 1.7.3 will I have to upgrade to the latest to get the fix?

@gibsondan
Copy link
Member

The current plan is for it to just be on the new version.

@alangenfeld
Copy link
Member

Worth noting that you can run the system processes (webserver/daemon) at a higher Dagster version than what is in use in the code server i.e. you could upgrade the helm chart to 1.9.7 and keep the python environment with your definitions at 1.7.3.

@theelderbeever
Copy link

Worth noting that you can run the system processes (webserver/daemon) at a higher Dagster version than what is in use in the code server i.e. you could upgrade the helm chart to 1.9.7 and keep the python environment with your definitions at 1.7.3.

@alangenfeld Appreciate the tip. I wasn't sure if that was possible but that maybe the solution I go with. Already had grpcio pinned at a non-broken version in my code server so running the system on a newer one would be great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests