Fix async KPO by waiting pod termination in `execute_complete` before cleanup #32467

hussein-awala · 2023-07-09T19:25:35Z

closes: #32458
related: #31348
related: #31335

This PR reverts #31348 which doesn't handle the case where do_xcom_push is True, and move the waiting strategy to execute_complete in order to wait pod termination before calling the cleanup method.

^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

This reverts commit 8f5de83. Signed-off-by: Hussein Awala <[email protected]>

Signed-off-by: Hussein Awala <[email protected]>

eladkal

LGTM

apilaskowski · 2023-11-08T15:17:05Z

@hussein-awala Was this introduced back? Or is this termination problem still happening in newest versions?

hussein-awala · 2023-11-08T21:56:00Z

Was this introduced back? Or is this termination problem still happening in newest versions?

The fix was released in cncf.kubernetes 7.3.0 (changelog), are you facing the same problem in a newer version?

apilaskowski · 2023-11-09T09:09:44Z

I meant the opposite.
In 7.2.0 KPO works fine, but in 7.3.0 it is not.
We suspect that this is due to removing following part of code:

if pod_status not in PodPhase.terminal_states:
    self.log.info(
        "Pod %s is still running. Sleeping for %s seconds.",
        self.pod_name,
        self.poll_interval,
    )
    await asyncio.sleep(self.poll_interval)

How do you feel, can this be reintroduced? Or was this solved in never version differently?
When we had high load of KPO they started to fail from Airflow perspective, while those pods were started properly.

apilaskowski · 2023-11-09T09:23:15Z

In our case we didn't use do_xcom_push, so it wasn't a problem for us.
We were wondering, if it might be reimplemented to include changes from #31348 properly.

hussein-awala added 2 commits July 9, 2023 21:21

Revert "Fix KubernetesPodTrigger waiting strategy (apache#31348)"

42b97a2

This reverts commit 8f5de83. Signed-off-by: Hussein Awala <[email protected]>

add wait pod termination before cleanup in execute_complete method

25d3018

Signed-off-by: Hussein Awala <[email protected]>

hussein-awala requested a review from jedcunningham as a code owner July 9, 2023 19:25

boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:providers provider:google Google (including GCP) related issues labels Jul 9, 2023

hussein-awala added the type:bug-fix Changelog: Bug Fixes label Jul 9, 2023

potiuk approved these changes Jul 9, 2023

View reviewed changes

eladkal approved these changes Jul 12, 2023

View reviewed changes

eladkal merged commit b3ce116 into apache:main Jul 12, 2023

eladkal mentioned this pull request Jul 12, 2023

Status of testing Providers that were prepared on July 12, 2023 #32568

Closed

20 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix async KPO by waiting pod termination in `execute_complete` before cleanup #32467

Fix async KPO by waiting pod termination in `execute_complete` before cleanup #32467

hussein-awala commented Jul 9, 2023

eladkal left a comment

apilaskowski commented Nov 8, 2023

hussein-awala commented Nov 8, 2023

apilaskowski commented Nov 9, 2023

apilaskowski commented Nov 9, 2023

Fix async KPO by waiting pod termination in execute_complete before cleanup #32467

Fix async KPO by waiting pod termination in execute_complete before cleanup #32467

Conversation

hussein-awala commented Jul 9, 2023

eladkal left a comment

Choose a reason for hiding this comment

apilaskowski commented Nov 8, 2023

hussein-awala commented Nov 8, 2023

apilaskowski commented Nov 9, 2023

apilaskowski commented Nov 9, 2023

Fix async KPO by waiting pod termination in `execute_complete` before cleanup #32467

Fix async KPO by waiting pod termination in `execute_complete` before cleanup #32467