Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embed: Add insecure-health-endpoint flag to enable Kubernetes HTTP probes #12370

Closed

Conversation

ironcladlou
Copy link
Contributor

Before this patch, Kubernetes users deploying etcd who configure mTLS
authentication for client and metrics endpoints are unable to leverage etcd's
/health endpoint for liveness or readiness probes because Kubernetes HTTP
probes don't support mTLS. Users must work around this limitation with
strategies like weak health checks using custom exec probes. Bypassing the
/health endpoint for this purpose can lead to false positive probe results and
quorum loss if a member is falsely determined to be ready during a rollout.

This patch introduces an --insecure-health-endpoint flag which, if specified,
enables an insecure HTTP /health endpoint which is functionally equivalent to
other /health endpoints. This enables etcd pods to specify reliable HTTP GET
probes for readiness and liveness checking.

This approach stands as an alternative to introducing a parallel TLS
configuration and listener just for the /health endpoint. The typical use case
is probably a non-exposed port used only by a container orchestration component
(e.g. the Kubelet).

Fixes #11993.

@codecov-io
Copy link

codecov-io commented Oct 7, 2020

Codecov Report

Merging #12370 into master will decrease coverage by 0.25%.
The diff coverage is 37.50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #12370      +/-   ##
==========================================
- Coverage   63.28%   63.03%   -0.26%     
==========================================
  Files         448      448              
  Lines       62342    62352      +10     
==========================================
- Hits        39454    39302     -152     
- Misses      18329    18479     +150     
- Partials     4559     4571      +12     
Impacted Files Coverage Δ
embed/config.go 55.62% <ø> (-1.83%) ⬇️
etcdserver/config.go 79.51% <ø> (ø)
embed/etcd.go 75.27% <9.09%> (-2.09%) ⬇️
etcdmain/config.go 84.27% <100.00%> (+0.06%) ⬆️
etcdserver/api/etcdhttp/metrics.go 75.00% <100.00%> (+1.25%) ⬆️
proxy/grpcproxy/register.go 0.00% <0.00%> (-82.50%) ⬇️
proxy/grpcproxy/cluster.go 36.73% <0.00%> (-37.76%) ⬇️
clientv3/naming/grpc.go 64.91% <0.00%> (-10.53%) ⬇️
clientv3/leasing/util.go 91.66% <0.00%> (-6.67%) ⬇️
clientv3/namespace/kv.go 68.75% <0.00%> (-6.25%) ⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24724af...0a4e9f2. Read the comment docs.

embed/config.go Show resolved Hide resolved
Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an integration test for this functionality.

Implementation lgtm

@ironcladlou
Copy link
Contributor Author

@ptabor Sure. Do you mean an e2e test? If you really do mean integration, could you point me to a good place to introduce one? The integration tests don't appear to work with the Etcd struct and exercise these paths. I might still be missing something.

If you did mean e2e, same question — any guidance on where this test would fix amongst existing tests? I couldn't find one that exercises things like the default metrics listener but I'm going to keep looking and playing around.

@ironcladlou ironcladlou force-pushed the insecure-health-endpoint branch 2 times, most recently from a4ded77 to 003da10 Compare October 9, 2020 13:49
@ironcladlou
Copy link
Contributor Author

@ptabor I added an e2e test covering the new endpoint and documented the field. Thanks for the feedback!

@ironcladlou ironcladlou force-pushed the insecure-health-endpoint branch from 003da10 to 92a8037 Compare October 9, 2020 14:04
@ptabor
Copy link
Contributor

ptabor commented Oct 9, 2020

@ironcladlou Thank you, looks awesome.

LGTM - as a recommendation to etcd maintainers.

@ironcladlou
Copy link
Contributor Author

The e2e fails in CI:

--- FAIL: TestInsecureHealthEndpointEnabled (1.04s)
    metrics_test.go:85: Get "http://localhost:20003/health": dial tcp 127.0.0.1:20003: connect: connection refused
FAIL
FAIL	go.etcd.io/etcd/tests/v3/e2e	320.513s
FAIL
FAIL: (code:1):
  % (cd tests && env go test -timeout=30m go.etcd.io/etcd/tests/v3/e2e)

ERROR: Tests for following packages failed:
   go.etcd.io/etcd/tests/v3/e2e
FAIL: 'e2e' failed at Fri Oct  9 14:15:54 UTC 2020
make: *** [docker-test] Error 1

I'll have to figure that out. Works locally, so something about what I'm doing here is not portable. Perhaps an environmental idiosyncrasy with localhost. I also need to add short timeouts to the client call.

@ptabor
Copy link
Contributor

ptabor commented Oct 9, 2020

I think majority of tests (but mostly integrational) are using unix://sockets for communication.

BTW: Please name your commit:
embed: Add insecure-health-endpoint flag to enable Kubernetes HTTP probes

to mitigate:

92a8037 Add `insecure-health-endpoint` flag to enable Kubernetes HTTP probes...
Expected commit title format '<package>{", "<package>}: <description>'
Got: 92a8037 Add `insecure-health-endpoint` flag to enable Kubernetes HTTP probes
FAIL: 'commit_title' failed at Fri Oct  9 14:46:23 UTC 2020

@ironcladlou
Copy link
Contributor Author

I was actually able to reproduce the e2e failure natively on darwin, so I'm proceeding with debugging...

@ironcladlou ironcladlou force-pushed the insecure-health-endpoint branch from 92a8037 to 79e8d82 Compare October 12, 2020 18:34
@ironcladlou ironcladlou changed the title Add insecure-health-endpoint flag to enable Kubernetes HTTP probes embed: Add insecure-health-endpoint flag to enable Kubernetes HTTP probes Oct 12, 2020
…probes

Before this patch, Kubernetes users deploying etcd who configure mTLS
authentication for client and metrics endpoints are unable to leverage etcd's
`/health` endpoint for liveness or readiness probes because Kubernetes HTTP
probes don't support mTLS. Users must work around this limitation with
strategies like weak health checks using custom `exec` probes. Bypassing the
`/health` endpoint for this purpose can lead to false positive probe results and
quorum loss if a member is falsely determined to be ready during a rollout.

This patch introduces an `--insecure-health-endpoint` flag which, if specified,
enables an insecure HTTP `/health` endpoint which is functionally equivalent to
other `/health` endpoints. This enables etcd pods to specify reliable HTTP GET
probes for readiness and liveness checking.

This approach stands as an alternative to introducing a parallel TLS
configuration and listener just for the `/health` endpoint. The typical use case
is probably a non-exposed port used only by a container orchestration component
(e.g. the Kubelet).

Fixes etcd-io#11993.
@ironcladlou ironcladlou force-pushed the insecure-health-endpoint branch from 79e8d82 to 0a4e9f2 Compare October 12, 2020 18:55
@ironcladlou
Copy link
Contributor Author

@ptabor fixed the test and the commit message.

The e2e problem was that v2 was sometimes being enabled in the test etcd server due to various tests mutating shared mutable fixtures (in this case, configTLS) causing non-deterministic behavior depending on the test set, order, etc.

For stability and as a matter of general good practice, I think all the tests should be audited and changed as necessary to work with copies of fixtures, but that's outside the scope of this PR.

@ptabor
Copy link
Contributor

ptabor commented Oct 13, 2020

LGTM. Thank you for root causing that.

Could you, please point me (or better: open an issue) with a link to exemplar configTLS modification in tests harness ?

@ironcladlou
Copy link
Contributor Author

@ptabor
Two quick examples from a static analysis of writes to one of the fixture struct fields:

cfg2 := configNoTLS
cfg2.dataDirPath = backupDir
cfg2.keepDataDir = true
cfg2.forceNewCluster = true
cfg2.enableV2 = true

cfg := configNoTLS
cfg.enableV2 = true

Since this is fresh on my brain, I'll go ahead and do a refactor to fix everything and open a new PR.

ironcladlou added a commit to ironcladlou/etcd that referenced this pull request Oct 13, 2020
The e2e tests can be flaky depending on due to various tests mutating shared
mutable fixtures, causing non-deterministic behavior depending on the test set,
order, etc.

For example, `configTLS` is mutated in at least two tests in such a way that the
config is potentially invalidated for any subsequent test running in the same
process (e.g. by setting the `enableV2` field`). This particular example caused
a substantial amount of confusion diagnosing the new test introduced for
etcd-io#12370.

Independent tests should not share mutable state unless deliberately. This patch
refactors the e2e test config fixtures to safeguard against these problems by
replacing the package variables (which cannot be made immutable) with functions
that return new instances.
@ironcladlou
Copy link
Contributor Author

Here's a PR which starts to fix the test state issues: #12392

ironcladlou added a commit to ironcladlou/etcd that referenced this pull request Oct 14, 2020
The e2e tests can be flaky due to various tests mutating shared mutable
fixtures, causing non-deterministic behavior depending on the test set, order,
etc.

For example, `configTLS` is mutated in at least two tests in such a way that the
config is potentially invalidated for any subsequent test running in the same
process (e.g. by setting the `enableV2` field). This particular example caused
a substantial amount of confusion diagnosing the new test introduced for
etcd-io#12370.

Independent tests should not share mutable state unless deliberately. This patch
refactors the e2e test config fixtures to safeguard against these problems by
replacing the package variables (which cannot easily be made immutable) with
functions that return new instances.
@ironcladlou
Copy link
Contributor Author

Somehow in all this we didn't realize that additional metrics endpoints (which would also serve health) can be specified without TLS info: https://github.com/openshift/cluster-etcd-operator/blob/e1dae76a52f2ae0aa31857c5d4249e78ad47c6e0/bindata/etcd/pod.yaml#L168

Assuming the user is okay with exposing the insecure /metrics endpoint in this way in addition to /health (which I think is true in the health check scenario) maybe there's no reason for #11993 or this PR after all. I'll test and report back.

@ironcladlou
Copy link
Contributor Author

Looks like #12370 (comment) was the answer all along. Thanks to everyone!

/close

@ironcladlou ironcladlou reopened this Nov 19, 2020
@ironcladlou
Copy link
Contributor Author

I'm reopening this per #11993 (comment).

@ironcladlou
Copy link
Contributor Author

@ptabor @gyuho please see #11993 (comment) — I think we still need this, otherwise it won't be possible to expose the health endpoint over HTTP without also exposing metrics insecurely, which is a constraint for us. If the approach is still agreeable I can rebase.

@stale
Copy link

stale bot commented Feb 17, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 17, 2021
@gyuho
Copy link
Contributor

gyuho commented Mar 9, 2021

@ironcladlou Could you rebase? We'd like this to be in our 3.5 release. /cc @ptabor

@stale stale bot removed the stale label Mar 9, 2021
@stale
Copy link

stale bot commented Aug 11, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@new-dream
Copy link
Contributor

@ironcladlou Could you rebase? We need this feature,thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

The health endpoint should have its own listener TLS configuration
5 participants