embed: Add `insecure-health-endpoint` flag to enable Kubernetes HTTP probes #12370

ironcladlou · 2020-10-07T19:54:44Z

Before this patch, Kubernetes users deploying etcd who configure mTLS
authentication for client and metrics endpoints are unable to leverage etcd's
/health endpoint for liveness or readiness probes because Kubernetes HTTP
probes don't support mTLS. Users must work around this limitation with
strategies like weak health checks using custom exec probes. Bypassing the
/health endpoint for this purpose can lead to false positive probe results and
quorum loss if a member is falsely determined to be ready during a rollout.

This patch introduces an --insecure-health-endpoint flag which, if specified,
enables an insecure HTTP /health endpoint which is functionally equivalent to
other /health endpoints. This enables etcd pods to specify reliable HTTP GET
probes for readiness and liveness checking.

This approach stands as an alternative to introducing a parallel TLS
configuration and listener just for the /health endpoint. The typical use case
is probably a non-exposed port used only by a container orchestration component
(e.g. the Kubelet).

Fixes #11993.

codecov-io · 2020-10-07T20:51:46Z

Codecov Report

Merging #12370 into master will decrease coverage by 0.25%.
The diff coverage is 37.50%.

@@            Coverage Diff             @@
##           master   #12370      +/-   ##
==========================================
- Coverage   63.28%   63.03%   -0.26%     
==========================================
  Files         448      448              
  Lines       62342    62352      +10     
==========================================
- Hits        39454    39302     -152     
- Misses      18329    18479     +150     
- Partials     4559     4571      +12

Impacted Files	Coverage Δ
embed/config.go	`55.62% <ø> (-1.83%)`	⬇️
etcdserver/config.go	`79.51% <ø> (ø)`
embed/etcd.go	`75.27% <9.09%> (-2.09%)`	⬇️
etcdmain/config.go	`84.27% <100.00%> (+0.06%)`	⬆️
etcdserver/api/etcdhttp/metrics.go	`75.00% <100.00%> (+1.25%)`	⬆️
proxy/grpcproxy/register.go	`0.00% <0.00%> (-82.50%)`	⬇️
proxy/grpcproxy/cluster.go	`36.73% <0.00%> (-37.76%)`	⬇️
clientv3/naming/grpc.go	`64.91% <0.00%> (-10.53%)`	⬇️
clientv3/leasing/util.go	`91.66% <0.00%> (-6.67%)`	⬇️
clientv3/namespace/kv.go	`68.75% <0.00%> (-6.25%)`	⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 24724af...0a4e9f2. Read the comment docs.

embed/config.go

ptabor

Please add an integration test for this functionality.

Implementation lgtm

ironcladlou · 2020-10-08T21:13:57Z

@ptabor Sure. Do you mean an e2e test? If you really do mean integration, could you point me to a good place to introduce one? The integration tests don't appear to work with the Etcd struct and exercise these paths. I might still be missing something.

If you did mean e2e, same question — any guidance on where this test would fix amongst existing tests? I couldn't find one that exercises things like the default metrics listener but I'm going to keep looking and playing around.

ironcladlou · 2020-10-09T13:49:41Z

@ptabor I added an e2e test covering the new endpoint and documented the field. Thanks for the feedback!

ptabor · 2020-10-09T14:06:00Z

@ironcladlou Thank you, looks awesome.

LGTM - as a recommendation to etcd maintainers.

ironcladlou · 2020-10-09T15:39:48Z

The e2e fails in CI:

--- FAIL: TestInsecureHealthEndpointEnabled (1.04s)
    metrics_test.go:85: Get "http://localhost:20003/health": dial tcp 127.0.0.1:20003: connect: connection refused
FAIL
FAIL	go.etcd.io/etcd/tests/v3/e2e	320.513s
FAIL
FAIL: (code:1):
  % (cd tests && env go test -timeout=30m go.etcd.io/etcd/tests/v3/e2e)

ERROR: Tests for following packages failed:
   go.etcd.io/etcd/tests/v3/e2e
FAIL: 'e2e' failed at Fri Oct  9 14:15:54 UTC 2020
make: *** [docker-test] Error 1

I'll have to figure that out. Works locally, so something about what I'm doing here is not portable. Perhaps an environmental idiosyncrasy with localhost. I also need to add short timeouts to the client call.

ptabor · 2020-10-09T16:11:05Z

I think majority of tests (but mostly integrational) are using unix://sockets for communication.

BTW: Please name your commit:
embed: Add insecure-health-endpoint flag to enable Kubernetes HTTP probes

to mitigate:

92a8037 Add `insecure-health-endpoint` flag to enable Kubernetes HTTP probes...
Expected commit title format '<package>{", "<package>}: <description>'
Got: 92a8037 Add `insecure-health-endpoint` flag to enable Kubernetes HTTP probes
FAIL: 'commit_title' failed at Fri Oct  9 14:46:23 UTC 2020

ironcladlou · 2020-10-12T14:51:46Z

I was actually able to reproduce the e2e failure natively on darwin, so I'm proceeding with debugging...

…probes Before this patch, Kubernetes users deploying etcd who configure mTLS authentication for client and metrics endpoints are unable to leverage etcd's `/health` endpoint for liveness or readiness probes because Kubernetes HTTP probes don't support mTLS. Users must work around this limitation with strategies like weak health checks using custom `exec` probes. Bypassing the `/health` endpoint for this purpose can lead to false positive probe results and quorum loss if a member is falsely determined to be ready during a rollout. This patch introduces an `--insecure-health-endpoint` flag which, if specified, enables an insecure HTTP `/health` endpoint which is functionally equivalent to other `/health` endpoints. This enables etcd pods to specify reliable HTTP GET probes for readiness and liveness checking. This approach stands as an alternative to introducing a parallel TLS configuration and listener just for the `/health` endpoint. The typical use case is probably a non-exposed port used only by a container orchestration component (e.g. the Kubelet). Fixes etcd-io#11993.

ironcladlou · 2020-10-12T19:00:03Z

@ptabor fixed the test and the commit message.

The e2e problem was that v2 was sometimes being enabled in the test etcd server due to various tests mutating shared mutable fixtures (in this case, configTLS) causing non-deterministic behavior depending on the test set, order, etc.

For stability and as a matter of general good practice, I think all the tests should be audited and changed as necessary to work with copies of fixtures, but that's outside the scope of this PR.

ptabor · 2020-10-13T20:48:26Z

LGTM. Thank you for root causing that.

Could you, please point me (or better: open an issue) with a link to exemplar configTLS modification in tests harness ?

ironcladlou · 2020-10-13T21:12:29Z

@ptabor
Two quick examples from a static analysis of writes to one of the fixture struct fields:

etcd/tests/e2e/ctl_v2_test.go

Lines 271 to 275 in 3f449a8

    
           cfg2 := configNoTLS 
        
           cfg2.dataDirPath = backupDir 
        
           cfg2.keepDataDir = true 
        
           cfg2.forceNewCluster = true 
        
           cfg2.enableV2 = true

etcd/tests/e2e/ctl_v3_migrate_test.go

Lines 31 to 32 in 96cce20

    
           cfg := configNoTLS 
        
           cfg.enableV2 = true

Since this is fresh on my brain, I'll go ahead and do a refactor to fix everything and open a new PR.

The e2e tests can be flaky depending on due to various tests mutating shared mutable fixtures, causing non-deterministic behavior depending on the test set, order, etc. For example, `configTLS` is mutated in at least two tests in such a way that the config is potentially invalidated for any subsequent test running in the same process (e.g. by setting the `enableV2` field`). This particular example caused a substantial amount of confusion diagnosing the new test introduced for etcd-io#12370. Independent tests should not share mutable state unless deliberately. This patch refactors the e2e test config fixtures to safeguard against these problems by replacing the package variables (which cannot be made immutable) with functions that return new instances.

ironcladlou · 2020-10-13T21:38:38Z

Here's a PR which starts to fix the test state issues: #12392

The e2e tests can be flaky due to various tests mutating shared mutable fixtures, causing non-deterministic behavior depending on the test set, order, etc. For example, `configTLS` is mutated in at least two tests in such a way that the config is potentially invalidated for any subsequent test running in the same process (e.g. by setting the `enableV2` field). This particular example caused a substantial amount of confusion diagnosing the new test introduced for etcd-io#12370. Independent tests should not share mutable state unless deliberately. This patch refactors the e2e test config fixtures to safeguard against these problems by replacing the package variables (which cannot easily be made immutable) with functions that return new instances.

ironcladlou · 2020-10-22T18:25:48Z

Somehow in all this we didn't realize that additional metrics endpoints (which would also serve health) can be specified without TLS info: https://github.com/openshift/cluster-etcd-operator/blob/e1dae76a52f2ae0aa31857c5d4249e78ad47c6e0/bindata/etcd/pod.yaml#L168

Assuming the user is okay with exposing the insecure /metrics endpoint in this way in addition to /health (which I think is true in the health check scenario) maybe there's no reason for #11993 or this PR after all. I'll test and report back.

ironcladlou · 2020-10-23T12:39:42Z

Looks like #12370 (comment) was the answer all along. Thanks to everyone!

/close

ironcladlou · 2020-11-19T15:45:55Z

I'm reopening this per #11993 (comment).

ironcladlou · 2020-11-19T15:48:08Z

@ptabor @gyuho please see #11993 (comment) — I think we still need this, otherwise it won't be possible to expose the health endpoint over HTTP without also exposing metrics insecurely, which is a constraint for us. If the approach is still agreeable I can rebase.

stale · 2021-02-17T19:58:31Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

gyuho · 2021-03-09T17:16:27Z

@ironcladlou Could you rebase? We'd like this to be in our 3.5 release. /cc @ptabor

stale · 2021-08-11T16:00:53Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

new-dream · 2022-11-24T04:01:58Z

@ironcladlou Could you rebase? We need this feature，thanks

ironcladlou mentioned this pull request Oct 7, 2020

The health endpoint should have its own listener TLS configuration #11993

Closed

ptabor reviewed Oct 8, 2020

View reviewed changes

embed/config.go Show resolved Hide resolved

ptabor suggested changes Oct 8, 2020

View reviewed changes

ironcladlou force-pushed the insecure-health-endpoint branch 2 times, most recently from a4ded77 to 003da10 Compare October 9, 2020 13:49

ironcladlou force-pushed the insecure-health-endpoint branch from 003da10 to 92a8037 Compare October 9, 2020 14:04

ironcladlou force-pushed the insecure-health-endpoint branch from 92a8037 to 79e8d82 Compare October 12, 2020 18:34

ironcladlou changed the title ~~Add insecure-health-endpoint flag to enable Kubernetes HTTP probes~~ embed: Add insecure-health-endpoint flag to enable Kubernetes HTTP probes Oct 12, 2020

ironcladlou force-pushed the insecure-health-endpoint branch from 79e8d82 to 0a4e9f2 Compare October 12, 2020 18:55

ironcladlou mentioned this pull request Oct 13, 2020

tests: prevent cross-test contamination via shared state #12392

Merged

ironcladlou closed this Nov 18, 2020

ironcladlou reopened this Nov 19, 2020

stale bot added the stale label Feb 17, 2021

stale bot removed the stale label Mar 9, 2021

stale bot added the stale label Aug 11, 2021

stale bot closed this Sep 2, 2021

timuthy mentioned this pull request Aug 5, 2022

Fix liveness and readiness probes gardener/etcd-druid#396

Merged

ptabor mentioned this pull request Mar 6, 2023

Fix watch stream starvation #15403

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

embed: Add `insecure-health-endpoint` flag to enable Kubernetes HTTP probes #12370

embed: Add `insecure-health-endpoint` flag to enable Kubernetes HTTP probes #12370

ironcladlou commented Oct 7, 2020

codecov-io commented Oct 7, 2020 •

edited

Loading

ptabor left a comment •

edited

Loading

ironcladlou commented Oct 8, 2020

ironcladlou commented Oct 9, 2020

ptabor commented Oct 9, 2020

ironcladlou commented Oct 9, 2020

ptabor commented Oct 9, 2020 •

edited

Loading

ironcladlou commented Oct 12, 2020

ironcladlou commented Oct 12, 2020

ptabor commented Oct 13, 2020 •

edited

Loading

ironcladlou commented Oct 13, 2020

ironcladlou commented Oct 13, 2020

ironcladlou commented Oct 22, 2020

ironcladlou commented Oct 23, 2020

ironcladlou commented Nov 19, 2020

ironcladlou commented Nov 19, 2020

stale bot commented Feb 17, 2021

gyuho commented Mar 9, 2021

stale bot commented Aug 11, 2021

new-dream commented Nov 24, 2022

embed: Add insecure-health-endpoint flag to enable Kubernetes HTTP probes #12370

embed: Add insecure-health-endpoint flag to enable Kubernetes HTTP probes #12370

Conversation

ironcladlou commented Oct 7, 2020

codecov-io commented Oct 7, 2020 • edited Loading

Codecov Report

ptabor left a comment • edited Loading

Choose a reason for hiding this comment

ironcladlou commented Oct 8, 2020

ironcladlou commented Oct 9, 2020

ptabor commented Oct 9, 2020

ironcladlou commented Oct 9, 2020

ptabor commented Oct 9, 2020 • edited Loading

ironcladlou commented Oct 12, 2020

ironcladlou commented Oct 12, 2020

ptabor commented Oct 13, 2020 • edited Loading

ironcladlou commented Oct 13, 2020

ironcladlou commented Oct 13, 2020

ironcladlou commented Oct 22, 2020

ironcladlou commented Oct 23, 2020

ironcladlou commented Nov 19, 2020

ironcladlou commented Nov 19, 2020

stale bot commented Feb 17, 2021

gyuho commented Mar 9, 2021

stale bot commented Aug 11, 2021

new-dream commented Nov 24, 2022

embed: Add `insecure-health-endpoint` flag to enable Kubernetes HTTP probes #12370

embed: Add `insecure-health-endpoint` flag to enable Kubernetes HTTP probes #12370

codecov-io commented Oct 7, 2020 •

edited

Loading

ptabor left a comment •

edited

Loading

ptabor commented Oct 9, 2020 •

edited

Loading

ptabor commented Oct 13, 2020 •

edited

Loading