Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set limits for XCOM pod #28125

Merged
merged 2 commits into from
Apr 5, 2023
Merged

Set limits for XCOM pod #28125

merged 2 commits into from
Apr 5, 2023

Conversation

dolfinus
Copy link
Contributor

@dolfinus dolfinus commented Dec 5, 2022

My k8s namespace has specified ResourceQuota, and thus every container should be created with both requests and limits resources set.

I've tried to run simple dag like this:

import datetime

from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator

from kubernetes.client import models as k8s

with DAG(
    dag_id="example_k8s_operator",
    schedule=None,
    start_date=datetime.datetime(2021, 1, 1),
    catchup=False,
    tags=["example"],
) as dag:

    task = KubernetesPodOperator(
        namespace="onetl-demo-user-namespace",
        name="hello-dry-run",
        image="debian",
        cmds=["bash", "-cx"],
        arguments=["echo", "10"],
        labels={"foo": "bar"},
        task_id="dry_run_demo",
        container_resources=k8s.V1ResourceRequirements(
            requests={"memory": "250M", "cpu": "100m"},
            limits={"memory": "250M", "cpu": "100m"},
        ),
        do_xcom_push=True,
    )

but got an exception:

Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.7/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 135, in run_pod_async
    body=sanitized_pod, namespace=pod.metadata.namespace, **kwargs
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 7356, in create_namespaced_pod
    return self.create_namespaced_pod_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 7469, in create_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 397, in request
    body=body)
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 281, in POST
    body=body)
  File "/home/airflow/.local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 234, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (403)
Reason: Forbidden

HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods \"hello-dry-run-8863acf43ec04fa0b293f81ca981d701\" is forbidden: failed quota: portalquota: must specify limits.cpu,limits.memory","reason":"Forbidden","details":{"name":"hello-dry-run-8863acf43ec04fa0b293f81ca981d701","kind":"pods"},"code":403}

Full pod spec from logs:

{
  "apiVersion": "v1",
  "kind": "Pod",
  "metadata": {
    "annotations": {},
    "labels": {
      "foo": "bar",
      "dag_id": "example_k8s_operator",
      "task_id": "dry_run_demo",
      "run_id": "manual__2022-12-05T151152.3955870000-b2ac559bc",
      "kubernetes_pod_operator": "True",
      "try_number": "1",
      "airflow_version": "2.4.1",
      "airflow_kpo_in_cluster": "True"
    },
    "name": "hello-dry-run-8863acf43ec04fa0b293f81ca981d701",
    "namespace": "onetl-demo-user-namespace"
  },
  "spec": {
    "affinity": {},
    "containers": [
      {
        "args": [
          "echo",
          "10"
        ],
        "command": [
          "bash",
          "-cx"
        ],
        "env": [],
        "envFrom": [],
        "image": "debian",
        "name": "base",
        "ports": [],
        "resources": {
          "limits": {
            "memory": "250M",
            "cpu": "100m"
          },
          "requests": {
            "memory": "250M",
            "cpu": "100m"
          }
        },
        "volumeMounts": [
          {
            "mountPath": "/airflow/xcom",
            "name": "xcom"
          }
        ]
      },
      {
        "command": [
          "sh",
          "-c",
          "trap \"exit 0\" INT; while true; do sleep 1; done;"
        ],
        "image": "alpine",
        "name": "airflow-xcom-sidecar",
        "resources": {
          "requests": {
            "cpu": "1m"
          }
        },
        "volumeMounts": [
          {
            "mountPath": "/airflow/xcom",
            "name": "xcom"
          }
        ]
      }
    ],
    "hostNetwork": false,
    "imagePullSecrets": [],
    "initContainers": [],
    "nodeSelector": {},
    "restartPolicy": "Never",
    "securityContext": {},
    "tolerations": [],
    "volumes": [
      {
        "emptyDir": {},
        "name": "xcom"
      }
    ]
  }
}

Container with pod command has the same resources I've set using KubernetesPodOperator.container_resources, but sidecar container with XCOM fetcher is created without limits, causing this exception. As a result I cannot get XCOM of any task created using this operator.

Here I've set limits as 1m for cpu and 10m for memory. IMHO, sleep 1 command does not need more resources, even 10m maybe too much.

I haven't add any options to change these values (like in #26766), it seems that currently there is no way to change command executed by this container.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:providers labels Dec 5, 2022
@dolfinus dolfinus force-pushed the main branch 2 times, most recently from 041ee6e to d4e3439 Compare December 5, 2022 20:09
@uranusjr
Copy link
Member

uranusjr commented Dec 6, 2022

Tests are failing.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about fixing the tests @dolfinus ?

@dolfinus
Copy link
Contributor Author

Is there any way to get failing pod log or something like that? I don't understand why container is starting using one executor but not other ones, and these executors are changing all the time

@potiuk
Copy link
Member

potiuk commented Dec 12, 2022

Surely - you can download the logs from the job - just click on the "Summary" and you will be able to downlad the logs from failed jobs - it is an artifact that is attached to the job (this is a complete export of all logs from failed test).

Also you can very easily reproduce what the tests are doing locally by following our documentation: https://github.com/apache/airflow/blob/main/TESTING.rst#typical-testing-pattern-for-kubernetes-tests is a step-by-step wizard like guide and above that you will find explanation of what happens in each step. Also breeze has k9s integration included so that you can explore the logs and everything else.

@potiuk
Copy link
Member

potiuk commented Dec 12, 2022

image

@dolfinus
Copy link
Contributor Author

XCOM sidecar container was killed because of OOM, even on memory limit 10Mi and 20Mi. I didn't expect that bash + sleep requires more RAM than 1Mi. Increased limit to 50Mi, fixed.

@dolfinus dolfinus marked this pull request as ready for review December 15, 2022 13:27
@dolfinus dolfinus requested review from potiuk and removed request for dstandish and jedcunningham December 29, 2022 18:58
@dolfinus
Copy link
Contributor Author

Any updates?

@potiuk
Copy link
Member

potiuk commented Jan 15, 2023

Well you removed two reviewers that likely would have been much better to review it (not sure why) so you limited your chances to get it reviewed ...

@potiuk
Copy link
Member

potiuk commented Jan 15, 2023

I am not sure we want to set arbitrary limits for the pods @dstandish @jedcunningham WDYT?

@dolfinus
Copy link
Contributor Author

Well you removed two reviewers

There was a "Requested changes" status, and I've clicked on it to inform that changes were made. I don't have permissions to change list of reviewers, maybe this is some kind of Github bug.

@potiuk
Copy link
Member

potiuk commented Jan 15, 2023

Well you removed two reviewers

There was a "Requested changes" status, and I've clicked on it to inform that changes were made. I don't have permissions to change list of reviewers, maybe this is some kind of Github bug.

Maybe. I've added them back.

@potiuk
Copy link
Member

potiuk commented Feb 19, 2023

Now. There are some confflicts to resolve - It's been some time it was out (and there is some backlog of issues/PRs so this takes longer than usual) - can you please rebase and fix the conflicts though.

I have not looked in detail and we've not heard from @jedcunningham and @dstandish, but I think the numbers here are too magical and forcing them on everyone is a bit too much to solve your case @dolfinus. You can add a feature to optionally add resource quota for those PODS with default being "not set" - that is the best way of handling it. I hardly imagine we will merge a change that we have a fixed memory used for the XCom POD (and it's backwards incompatible too).

@dolfinus dolfinus marked this pull request as draft March 23, 2023 15:23
@dolfinus dolfinus closed this Mar 23, 2023
@dolfinus dolfinus reopened this Mar 23, 2023
@dolfinus dolfinus force-pushed the main branch 2 times, most recently from 5f9a2f2 to 148ca36 Compare March 24, 2023 08:58
@dolfinus dolfinus marked this pull request as ready for review March 24, 2023 11:17
@dolfinus
Copy link
Contributor Author

dolfinus commented Mar 24, 2023

Ok, got your point.

I've added xcom_sidecar_container_resources key to extras field of k8s connection, just like xcom_sidecar_container_image. User can set here resources in JSON syntax, and they will be passed to sidecar container.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thsi is much better. Thank you :)

@potiuk potiuk merged commit dc4dd91 into apache:main Apr 5, 2023
dstandish added a commit to astronomer/airflow that referenced this pull request May 12, 2023
We probably should not be configuring xcom settings from the hook in this way but for better or worse we've done it, first in apache#26766 and again in apache#28125.

The problem is that other operators derived from KPO, e.g. GKEStartPodOperator, may use a different hook that doesn't have this xcom-specific methods.  So KPO needs to tolerate that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants