Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to control which subnets and IPs get used for the API loadbalancer #10741

Merged
merged 14 commits into from
Feb 14, 2021

Conversation

codablock
Copy link
Contributor

This PR implements the new field subnets for the api loadbalancer in the cluster spec:

spec:
  api:
    loadBalancer:
      type: Internal
      subnets:
        - name: subnet-a
          privateIPv4Address: 172.16.1.10

It allows to control which subnets are used by the api load balancer. This is useful when VPCs and subnets are managed externally and a more granular separation between subnets is desired, e.g. when you have subnets that offer VPN access to on-prem networks, subnets which offer public IPs and subnets which are completely private.

In such case, the subnets must be carefully chosen to not expose the API in the wrong place.

Review can be done on a per-commit basis as I've tried to separate changes.

Move checks for valid subnet operations into CheckChanges. This also fixes
a bug where changes would cause immutable field errors while it's actually
perfectly fine to add new subnets (only detaching is forbidden).

This also commit changes the actualSubnets and expectedSubnets lists to
be maps of *string. This is in preparation for the next commit that then
relies on it being a map.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 5, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @codablock. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Feb 5, 2021
@k8s-ci-robot k8s-ci-robot added area/api area/documentation area/provider/aws Issues or PRs related to aws provider size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 5, 2021
@codablock
Copy link
Contributor Author

FYI, I've tested this on the 1.19 branch and would be really happy if this could be backported/cherry-picked into 1.19 :)

SubnetMappings allow to explicitely set the private IPv4 address that
must be used for the NLB.

SubnetMappings and Subnets in the AWS API are compatible as long as the
address settings are not changes, making this commit backwards compatible.
Comment on lines +124 to +125
It is only allowed to add more subnets and forbidden to remove existing ones. This is due to limitations on AWS
ELBs and NLBs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there the possibility to remove existing ones if you accept that the consequence is destroying the loadbalancer and creating a new one? eg --force-recreate-loadbalancer
I'd consider losing access to the loadbalancer acceptible during a maintenance window.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd assume that this would be possible. But this is not a new problem and thus should be part of another PR. Currently, when you change your set of shared subnets, you'd run into the same issue already as the automatically chosen subnets would differ then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was primarily around the validation side of things. If I wanted to change those values and accepted that I'd have to (manually) remove the loadbalancer, would your changes to kops reject my edits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would accept the edits and only fail when running the update. The failure is a little bit unfortunate as it keeps retrying but then of course never succeeds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A workaround for the above behavior is to manually delete the load balancer. A kops update cluster should then report that it will (re)create the load balancer which should succeed with --yes. You'll run into certificate issues with gossip clusters though since the load balancer's DNS name is used in various certs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see you mentioned this a few lines below. I believe you'll also need to kops rolling-update cluster --instance-group-roles=Master --cloudonly after kops update cluster --yes recreates the NLB in order to pick up new certificates.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rifelpet I did not observe the need for rolling-update while doing my testing, which I did on a semi-live (not prod, but a lot of stuff deployed) cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may only be gossip clusters that need the rolling update (cluster names ending in .k8s.local, was yours gossip or using route53 ? Now that I think about it more, I think the nodes themselves might need to be rolled too to pick up the new load balancer dns name which is pretty invasive. We could say "It is not recommended to make this change on gossip clusters due to the requirement of performing a --cloudonly rolling update on the entire cluster which results in temporary downtime of all workloads. It is recommended to create a new cluster instead."

pkg/model/awsmodel/api_loadbalancer.go Outdated Show resolved Hide resolved
pkg/model/awsmodel/api_loadbalancer.go Outdated Show resolved Hide resolved
pkg/model/awsmodel/api_loadbalancer.go Outdated Show resolved Hide resolved
upup/pkg/fi/cloudup/awstasks/network_load_balancer.go Outdated Show resolved Hide resolved
upup/pkg/fi/cloudup/awstasks/network_load_balancer.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 10, 2021
@codablock
Copy link
Contributor Author

@rifelpet Thanks for your review. I've gone through it and fixed things where necessary. You can review the changes per-commit.

pkg/apis/kops/validation/aws.go Show resolved Hide resolved
pkg/apis/kops/validation/aws.go Outdated Show resolved Hide resolved
pkg/apis/kops/validation/aws.go Outdated Show resolved Hide resolved
pkg/apis/kops/validation/aws.go Outdated Show resolved Hide resolved
pkg/apis/kops/validation/aws.go Outdated Show resolved Hide resolved
upup/pkg/fi/cloudup/awstasks/network_load_balancer.go Outdated Show resolved Hide resolved
upup/pkg/fi/cloudup/awstasks/network_load_balancer.go Outdated Show resolved Hide resolved
@codablock
Copy link
Contributor Author

@rifelpet I pushed a few more commits to handle your latest comments.

@rifelpet
Copy link
Member

this looks good to me, thanks for implementing this! I'll have someone else take a pass at it too

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 10, 2021
@codablock
Copy link
Contributor Author

@rifelpet Can you help me find the logs of the test failures that appeared now?

@rifelpet
Copy link
Member

Yea you can click the Details button in the list of checks at the bottom of the PR. here are the two failures:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kops/10741/pull-kops-verify-staticcheck/1359527983353696256
looks like a pretty trivial fix in the validation test file.

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kops/10741/pull-kops-verify-cloudformation/1359527985023029248

is a bit strange, i think the line numbers are off but its essentially complaining that the allocation ID and private IP v4 address fields in the cloudformation JSON file are null. We should probably not include them at all (omitempty field tag) if that is the case.

@codablock
Copy link
Contributor Author

@rifelpet Thanks, that helped me :) Should be fixed now.

@rifelpet
Copy link
Member

/lgtm

I'll give others a chance for input otherwise we can get this merged soon. I'll bring it up at kOps office hours on friday too. Excited to get this in :)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2021
@codablock
Copy link
Contributor Author

@rifelpet What do you think about the chances of this being cherry-picked into 1.19?

@rifelpet
Copy link
Member

1.19.X releases will contain bug fixes only. Since this is a new feature it likely wont get backported to 1.19 though we are aiming for an accelerated 1.20.0 release so the wait shouldn't be too long.

@codablock
Copy link
Contributor Author

ok, having this in 1.20.0 would already be very cool.

@rifelpet
Copy link
Member

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: codablock, rifelpet

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2021
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit cd10383 into kubernetes:master Feb 14, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.20 milestone Feb 14, 2021
@codablock
Copy link
Contributor Author

Just seen that this got merged, thanks :) Any steps required from my side to get this cherry-picked into 1.20?

@rifelpet
Copy link
Member

@codablock I took care of it in #10829

k8s-ci-robot added a commit that referenced this pull request Feb 15, 2021
…41-origin-release-1.20

Automated cherry pick of #10741: Refactor and fix NLB subnet change checks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/api area/documentation area/provider/aws Issues or PRs related to aws provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants