Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix async KPO by waiting pod termination in execute_complete before cleanup #32467

Merged
merged 2 commits into from
Jul 12, 2023

Conversation

hussein-awala
Copy link
Member

closes: #32458
related: #31348
related: #31335

This PR reverts #31348 which doesn't handle the case where do_xcom_push is True, and move the waiting strategy to execute_complete in order to wait pod termination before calling the cleanup method.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added provider:cncf-kubernetes Kubernetes provider related issues area:providers provider:google Google (including GCP) related issues labels Jul 9, 2023
@hussein-awala hussein-awala added the type:bug-fix Changelog: Bug Fixes label Jul 9, 2023
Copy link
Contributor

@eladkal eladkal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@apilaskowski
Copy link
Contributor

@hussein-awala Was this introduced back? Or is this termination problem still happening in newest versions?

@hussein-awala
Copy link
Member Author

Was this introduced back? Or is this termination problem still happening in newest versions?

The fix was released in cncf.kubernetes 7.3.0 (changelog), are you facing the same problem in a newer version?

@apilaskowski
Copy link
Contributor

I meant the opposite.
In 7.2.0 KPO works fine, but in 7.3.0 it is not.
We suspect that this is due to removing following part of code:

if pod_status not in PodPhase.terminal_states:
    self.log.info(
        "Pod %s is still running. Sleeping for %s seconds.",
        self.pod_name,
        self.poll_interval,
    )
    await asyncio.sleep(self.poll_interval)

How do you feel, can this be reintroduced? Or was this solved in never version differently?
When we had high load of KPO they started to fail from Airflow perspective, while those pods were started properly.

@apilaskowski
Copy link
Contributor

In our case we didn't use do_xcom_push, so it wasn't a problem for us.
We were wondering, if it might be reimplemented to include changes from #31348 properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers provider:cncf-kubernetes Kubernetes provider related issues provider:google Google (including GCP) related issues type:bug-fix Changelog: Bug Fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deferrable KPO - stuck with do_xcom_push=True
4 participants