Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embed server should log errors (and not get stuck) #14116

Merged
merged 2 commits into from
Jun 15, 2022

Conversation

ptabor
Copy link
Contributor

@ptabor ptabor commented Jun 15, 2022

This PR adds:

  1. reacher logging to embed server, to show reasons why in particular tests got stuck with Bump OpenTelemetry dependencies to v1.7.0 #14104 applied
  2. fix the reason why they got actually stuck - i.e. sctx.serversC being not closed in case of earlier exit from the Serve function:

Examplar log with that change (assuming grpc updated to 1.48 like in #14104):

$ cd tests/integration/clientv3/snapshot
$ go test --run=TestSaveSnapshotFilePermissions -v -timeout=1m
=== RUN   TestSaveSnapshotFilePermissions
    logger.go:130: 2022-06-15T12:30:57.656+0200	INFO	default	configuring peer listeners	{"listen-peer-urls": ["unix://localhost:43638"]}
    logger.go:130: 2022-06-15T12:30:57.656+0200	INFO	default	configuring client listeners	{"listen-client-urls": ["unix://localhost:1345"]}
    logger.go:130: 2022-06-15T12:30:57.656+0200	INFO	default	starting an etcd server	{"etcd-version": "3.6.0-alpha.0", "git-sha": "Not provided (use ./build instead of go build)", "go-version": "go1.17.6", "go-os": "darwin", "go-arch": "amd64", "max-cpu-set": 8, "max-cpu-available": 8, "member-initialized": false, "name": "default", "data-dir": "/var/folders/t1/3m8z9xz93t9c3vpt7zyzjm6w00374n/T/TestSaveSnapshotFilePermissions548358693/001", "wal-dir": "", "wal-dir-dedicated": "", "member-dir": "/var/folders/t1/3m8z9xz93t9c3vpt7zyzjm6w00374n/T/TestSaveSnapshotFilePermissions548358693/001/member", "force-new-cluster": false, "heartbeat-interval": "100ms", "election-timeout": "1s", "wait-cluster-ready-timeout": "5s", "initial-election-tick-advance": true, "snapshot-count": 100000, "snapshot-catchup-entries": 5000, "initial-advertise-peer-urls": ["unix://localhost:43638"], "listen-peer-urls": ["unix://localhost:43638"], "advertise-client-urls": ["unix://localhost:1345"], "listen-client-urls": ["unix://localhost:1345"], "listen-metrics-urls": [], "cors": ["*"], "host-whitelist": ["*"], "initial-cluster": "default=unix://localhost:43638", "initial-cluster-state": "new", "initial-cluster-token": "etcd-cluster", "quota-size-bytes": 2147483648, "pre-vote": true, "initial-corrupt-check": false, "corrupt-check-time-interval": "0s", "auto-compaction-mode": "", "auto-compaction-retention": "0s", "auto-compaction-interval": "0s", "discovery-url": "", "discovery-proxy": "", "discovery-token": "", "discovery-endpoints": "", "discovery-dial-timeout": "2s", "discovery-request-timeout": "5s", "discovery-keepalive-time": "2s", "discovery-keepalive-timeout": "6s", "discovery-insecure-transport": false, "discovery-insecure-skip-tls-verify": false, "discovery-cert": "", "discovery-key": "", "discovery-cacert": "", "discovery-user": "", "downgrade-check-interval": "5s", "max-learners": 1}
    logger.go:130: 2022-06-15T12:30:57.656+0200	WARN	default	check file permission	{"error": "directory \"/var/folders/t1/3m8z9xz93t9c3vpt7zyzjm6w00374n/T/TestSaveSnapshotFilePermissions548358693/001\" exist, but the permission is \"drwxr-xr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"}
    logger.go:130: 2022-06-15T12:30:57.685+0200	INFO	default	opened backend db	{"path": "/var/folders/t1/3m8z9xz93t9c3vpt7zyzjm6w00374n/T/TestSaveSnapshotFilePermissions548358693/001/member/snap/db", "took": "25.727226ms"}
    logger.go:130: 2022-06-15T12:30:57.757+0200	INFO	default	starting local member	{"local-member-id": "88ac02f4a2a23142", "cluster-id": "68c0d29cdbf7c875"}
    logger.go:130: 2022-06-15T12:30:57.757+0200	INFO	default.raft	88ac02f4a2a23142 switched to configuration voters=()
    logger.go:130: 2022-06-15T12:30:57.757+0200	INFO	default.raft	88ac02f4a2a23142 became follower at term 0
    logger.go:130: 2022-06-15T12:30:57.757+0200	INFO	default.raft	newRaft 88ac02f4a2a23142 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
    logger.go:130: 2022-06-15T12:30:57.757+0200	INFO	default.raft	88ac02f4a2a23142 became follower at term 1
    logger.go:130: 2022-06-15T12:30:57.757+0200	INFO	default.raft	88ac02f4a2a23142 switched to configuration voters=(9848249734876246338)
    logger.go:130: 2022-06-15T12:30:57.819+0200	WARN	default	simple token is not cryptographically signed
    logger.go:130: 2022-06-15T12:30:57.860+0200	INFO	default	kvstore restored	{"current-rev": 1}
    logger.go:130: 2022-06-15T12:30:57.940+0200	INFO	default	enabled backend quota with default value	{"quota-name": "v3-applier", "quota-size-bytes": 2147483648, "quota-size": "2.1 GB"}
    logger.go:130: 2022-06-15T12:30:57.940+0200	INFO	default	starting etcd server	{"local-member-id": "88ac02f4a2a23142", "local-server-version": "3.6.0-alpha.0", "cluster-version": "to_be_decided"}
    logger.go:130: 2022-06-15T12:30:57.940+0200	WARN	default	failed to get file descriptor usage	{"error": "cannot get FDUsage on darwin"}
    logger.go:130: 2022-06-15T12:30:57.941+0200	INFO	default	now serving peer/client/metrics	{"local-member-id": "88ac02f4a2a23142", "initial-advertise-peer-urls": ["unix://localhost:43638"], "listen-peer-urls": ["unix://localhost:43638"], "advertise-client-urls": ["unix://localhost:1345"], "listen-client-urls": ["unix://localhost:1345"], "listen-metrics-urls": []}
    logger.go:130: 2022-06-15T12:30:57.941+0200	INFO	default	serving peer traffic	{"address": "localhost:43638"}
    logger.go:130: 2022-06-15T12:30:57.941+0200	INFO	default	cmux::serve	{"address": "localhost:43638"}
    logger.go:130: 2022-06-15T12:30:57.941+0200	INFO	default	started as single-node; fast-forwarding election ticks	{"local-member-id": "88ac02f4a2a23142", "forward-ticks": 9, "forward-duration": "900ms", "election-ticks": 10, "election-timeout": "1s"}
    logger.go:130: 2022-06-15T12:30:57.942+0200	INFO	default.raft	88ac02f4a2a23142 switched to configuration voters=(9848249734876246338)
    logger.go:130: 2022-06-15T12:30:57.943+0200	INFO	default	added member	{"cluster-id": "68c0d29cdbf7c875", "local-member-id": "88ac02f4a2a23142", "added-peer-id": "88ac02f4a2a23142", "added-peer-peer-urls": ["unix://localhost:43638"], "added-peer-is-learner": false}
    logger.go:130: 2022-06-15T12:30:57.961+0200	INFO	default.raft	88ac02f4a2a23142 is starting a new election at term 1
    logger.go:130: 2022-06-15T12:30:57.961+0200	INFO	default.raft	88ac02f4a2a23142 became pre-candidate at term 1
    logger.go:130: 2022-06-15T12:30:57.961+0200	INFO	default.raft	88ac02f4a2a23142 received MsgPreVoteResp from 88ac02f4a2a23142 at term 1
    logger.go:130: 2022-06-15T12:30:57.961+0200	INFO	default.raft	88ac02f4a2a23142 became candidate at term 2
    logger.go:130: 2022-06-15T12:30:57.961+0200	INFO	default.raft	88ac02f4a2a23142 received MsgVoteResp from 88ac02f4a2a23142 at term 2
    logger.go:130: 2022-06-15T12:30:57.961+0200	INFO	default.raft	88ac02f4a2a23142 became leader at term 2
    logger.go:130: 2022-06-15T12:30:57.961+0200	INFO	default.raft	raft.node: 88ac02f4a2a23142 elected leader 88ac02f4a2a23142 at term 2
    logger.go:130: 2022-06-15T12:30:57.962+0200	INFO	default	published local member to cluster through raft	{"local-member-id": "88ac02f4a2a23142", "local-member-attributes": "{Name:default ClientURLs:[unix://localhost:1345]}", "cluster-id": "68c0d29cdbf7c875", "publish-timeout": "7s"}
    logger.go:130: 2022-06-15T12:30:57.962+0200	INFO	default	setting up initial cluster version using v3 API	{"cluster-version": "3.6"}
    logger.go:130: 2022-06-15T12:30:57.962+0200	INFO	default	ready to serve client requests
!!!    logger.go:130: 2022-06-15T12:30:57.966+0200	ERROR	default	registerGateway failed:	{"error": "failed to build !!! resolver: invalid (non-empty) authority: localhost:1345 {unix localhost:1345  {unix   localhost:1345   false   }}"}
    logger.go:130: 2022-06-15T12:30:57.966+0200	WARN	default	stopping grpc server due to error	{"error": "failed to build resolver: invalid (non-empty) authority: localhost:1345 {unix localhost:1345  {unix   localhost:1345   false   }}"}
    logger.go:130: 2022-06-15T12:30:57.982+0200	INFO	default	set initial cluster version	{"cluster-id": "68c0d29cdbf7c875", "local-member-id": "88ac02f4a2a23142", "cluster-version": "3.6"}
    logger.go:130: 2022-06-15T12:30:57.982+0200	INFO	default	enabled capabilities for version	{"cluster-version": "3.6"}
    logger.go:130: 2022-06-15T12:30:57.982+0200	INFO	default	cluster version is updated	{"cluster-version": "3.6"}
    logger.go:130: 2022-06-15T12:30:57.982+0200	WARN	default	Failed to detect schema version	{"error": "missing confstate information"}
    logger.go:130: 2022-06-15T12:30:57.983+0200	ERROR	default	failed to update storage version	{"cluster-version": "3.6.0", "error": "cannot detect storage schema version: missing confstate information"}
    logger.go:130: 2022-06-15T12:31:00.964+0200	WARN	client	retrying of unary invoker failed	{"target": "etcd-endpoints://0xc00000c3c0/localhost:1345", "method": "/etcdserverpb.KV/Put", "attempt": 0, "error": "rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial unix localhost:1345: connect: no such file or directory\""}
    v3_snapshot_test.go:118: context deadline exceeded
    logger.go:130: 2022-06-15T12:31:00.965+0200	INFO	default	closing etcd server	{"name": "default", "data-dir": "/var/folders/t1/3m8z9xz93t9c3vpt7zyzjm6w00374n/T/TestSaveSnapshotFilePermissions548358693/001", "advertise-peer-urls": ["unix://localhost:43638"], "advertise-client-urls": ["unix://localhost:1345"]}
    logger.go:130: 2022-06-15T12:31:00.965+0200	INFO	default	skipped leadership transfer for single voting member cluster	{"local-member-id": "88ac02f4a2a23142", "current-leader-member-id": "88ac02f4a2a23142"}
    logger.go:130: 2022-06-15T12:31:01.052+0200	INFO	default	stopping serving peer traffic	{"address": "localhost:43638"}
!!!    logger.go:130: 2022-06-15T12:31:01.053+0200	ERROR	default	setting up serving from embedded etcd failed.	{"error": "accept unix localhost:43638: use of closed network connection"}
    logger.go:130: 2022-06-15T12:31:01.055+0200	INFO	default	stopped serving peer traffic	{"address": "localhost:43638"}
    logger.go:130: 2022-06-15T12:31:01.055+0200	INFO	default	closed etcd server	{"name": "default", "data-dir": "/var/folders/t1/3m8z9xz93t9c3vpt7zyzjm6w00374n/T/TestSaveSnapshotFilePermissions548358693/001", "advertise-peer-urls": ["unix://localhost:43638"], "advertise-client-urls": ["unix://localhost:1345"]}
--- FAIL: TestSaveSnapshotFilePermissions (3.41s)
FAIL
exit status 1
FAIL	go.etcd.io/etcd/tests/v3/integration/clientv3/snapshot	4.336s
ptab-macbookpro:snapshot ptab$ git diff

@ptabor ptabor force-pushed the 20220613-embed-errors branch from 1d6b1fd to 42c4ee5 Compare June 15, 2022 10:57
server/embed/serve.go Outdated Show resolved Hide resolved
server/embed/serve.go Outdated Show resolved Hide resolved
Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nits, feel free to ignore

@ptabor ptabor force-pushed the 20220613-embed-errors branch from 42c4ee5 to 0337e0d Compare June 15, 2022 11:49
ptabor added 2 commits June 15, 2022 13:50
… start.

So far the errors were directed to Etcd.Errc (channel) that is not being consumed in practice.

Signed-off-by: Piotr Tabor <[email protected]>
Before this change:

```
gwmux, err = sctx.registerGateway([]grpc.DialOption{grpc.WithInsecure()})
if err != nil { return err; }
```

caused the Etcd.Serve() method finish without sctx.serversC channel being closed.
The code closing Etcd got stuck waiting for all the servers to be running.

Signed-off-by: Piotr Tabor <[email protected]>
@ptabor ptabor force-pushed the 20220613-embed-errors branch from 0337e0d to 1c40158 Compare June 15, 2022 11:50
Copy link
Member

@spzala spzala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks @ptabor

@ptabor
Copy link
Contributor Author

ptabor commented Jun 15, 2022

Thank you for review.

@ptabor ptabor merged commit a1fe0b8 into etcd-io:main Jun 15, 2022
@ptabor ptabor deleted the 20220613-embed-errors branch June 15, 2022 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants