Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: implement function to identify if node is present in aws #5632

Conversation

vadasambar
Copy link
Member

@vadasambar vadasambar commented Mar 29, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

I think description of #5054 (comment) explains it well:

...original intent of determining the deleted nodes was incorrect, which led to the issues reported by other users. The nodes tainted with ToBeDeleted were misidentified as Deleted instead of Ready/Unready, which caused a miscalculation of the node being included as Upcoming. This caused problems described in #3949 and #4456.

#5054 exposed interface function HasInstance for cloud providers to implement. This function is used to determine if a node is deleted by querying the cloud provider. It falls back to the ToBeDeleted taint if the function is not implemented.

Which issue(s) this PR fixes:

Fixes #4456

Special notes for your reviewer:

Nothing in particular.

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Nothing in particular.

cc: @fookenc

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 29, 2023
@k8s-ci-robot k8s-ci-robot requested a review from drmorr0 March 29, 2023 05:55
}

// we don't care about the instance health
if _, ok := aws.awsManager.asgCache.instanceStatus[*awsRef]; !ok {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have instanceStatus field on asgCache to track health status by EC2 instance. I am re-using the instantStatus map to check if the node is present without caring for its health since we just want to check if the instance is present in the cloud provider.

instanceStatus is updated at

ref := m.buildInstanceRefFromAWS(instance)
newInstanceToAsgCache[ref] = asg
newAsgToInstancesCache[asg.AwsRef][i] = ref
newInstanceStatusMap[ref] = instance.HealthStatus

@vadasambar vadasambar marked this pull request as ready for review March 29, 2023 06:05
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 29, 2023
@k8s-ci-robot k8s-ci-robot requested a review from jaypipes March 29, 2023 06:05
}

// we don't care about the instance health
if _, ok := aws.awsManager.asgCache.instanceStatus[*awsRef]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vadasambar please use the asgCache.InstanceStatus function instead of directly accessing (in a non-thread-safe fashion) the underlying map :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for noticing it! I have updated the code.


// we don't care about the status
status, err := aws.awsManager.asgCache.InstanceStatus(*awsRef)
if status != nil {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am checking status for nil because I think this is a more reliable way (as opposed to checking there was no error) to check if the node is present since InstanceStatus wouldn't return nil for a status unless some error occurs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The fact that we return any error from HasInstance or asgCache.InstanceStatus is kind of annoying anyway. Really, these methods should just return bool IMHO.

return true, nil
}

return false, fmt.Errorf("node is not present in aws: %v", err)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this CA should print the line number of the error but I added an extra wrapper around the error just in case to identify the error came from HasInstance. Happy to change it based on feedback.

@vadasambar vadasambar requested review from jaypipes and removed request for drmorr0 and gjtempleton March 30, 2023 04:47
Copy link
Contributor

@jaypipes jaypipes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thank you @vadasambar!


// we don't care about the status
status, err := aws.awsManager.asgCache.InstanceStatus(*awsRef)
if status != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The fact that we return any error from HasInstance or asgCache.InstanceStatus is kind of annoying anyway. Really, these methods should just return bool IMHO.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jaypipes, vadasambar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 30, 2023
- this is a follow-up to kubernetes#5054
- this might fix kubernetes#4456

Signed-off-by: vadasambar <[email protected]>

fix: make `HasInstance` in aws provider thread-safe

Signed-off-by: vadasambar <[email protected]>
@vadasambar vadasambar force-pushed the fix/4456/identify-deleted-nodes-for-aws branch from 16695e9 to 1cb55fe Compare March 31, 2023 04:16
@vadasambar
Copy link
Member Author

+1 thank you @vadasambar!

Thank you for the review. I have squashed the commits into a single commit. I think I can't merge the PR by myself.

This is what I see:
image

@gjtempleton
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 10, 2023
@k8s-ci-robot k8s-ci-robot merged commit dfaadfa into kubernetes:master Apr 10, 2023
@zaafar
Copy link

zaafar commented Nov 7, 2023

Can this be back ported to 1.24 and 1.25?

EDIT: nevermind, looks like it was.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CA fails to scale-up or cancel in-progress scale down when there are un-schedulable pods
5 participants