Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange memory usage pattern after v3.1.3 -> v3.2.6 upgrade #8472

Closed
danielfm opened this issue Aug 31, 2017 · 50 comments
Closed

Strange memory usage pattern after v3.1.3 -> v3.2.6 upgrade #8472

danielfm opened this issue Aug 31, 2017 · 50 comments

Comments

@danielfm
Copy link

danielfm commented Aug 31, 2017

First reported this issue in kubernetes-retired/kube-aws#795.

I'm experiencing a strange memory usage pattern after upgrading etcd from v3.1.3 to v3.2.6:

image

We can see very clearly the memory consumption used to stay stable until the update, after which the memory consumption went through the roof. Despite the high memory usage, so far it seems the load is under control, and found no errors in the logs:

$ uptime
 17:44:23 up  5:32,  1 user,  load average: 0.22, 0.23, 0.20

$ free -m
             total       used       free     shared    buffers     cached
Mem:          7482       2706       4776         66        215       1288
-/+ buffers/cache:       1201       6280
Swap:            0          0          0

My environment:

Is this behavior expected?

@heyitsanthony
Copy link
Contributor

Possibly from GC. Did anything change around 15:00?

@gyuho does this happen with kubemark?

@gyuho
Copy link
Contributor

gyuho commented Aug 31, 2017

Never seen it in our kubemark tests http://dash.etcd.io/dashboard/db/test-etcd-kubernetes?orgId=1.

@danielfm
Copy link
Author

@heyitsanthony

Possibly from GC. Did anything change around 15:00?

No, the drop happened without any kind of intervention or change on my part.

@danielfm
Copy link
Author

Just so you know, I'm running Kubernetes v1.7.4.

Do you folks have some tips on how can I proceed to find out what's happening, and if it's indeed a problem? Right now I'm trying to take a heap dump from etcd, but I'm having trouble doing so due to the fact I'm running etcd from inside rkt.

@xiang90
Copy link
Contributor

xiang90 commented Aug 31, 2017

@danielfm Probably you can run etcd with go GC trace enable. I would suspect it is due to less aggressive GC.

@danielfm
Copy link
Author

@xiang90 @heyitsanthony I spent some time looking at our Prometheus to find out other metrics that could help explain that memory usage, and here goes some of them:

sum(go_memstats_alloc_bytes{job="etcd"}) by (instance)
image

sum(go_memstats_heap_idle_bytes{job="etcd"}) by (instance)
image

sum(go_memstats_heap_sys_bytes{job="etcd"}) by (instance)
image

sum(go_memstats_heap_inuse_bytes{job="etcd"}) by (instance)
image

sum(go_memstats_next_gc_bytes{job="etcd"}) by (instance)
image

sum(rate(go_gc_duration_seconds_count{job="etcd"}[2m])) by (instance) * 60
image

What do you think after looking at these graphs? Is the GC responsible for this behavior? If you need other metrics in order to answer this, please let me know and I'll get them for you.

@xiang90
Copy link
Contributor

xiang90 commented Aug 31, 2017

@danielfm Do you have any CPU usage and QPS data?

@danielfm
Copy link
Author

danielfm commented Aug 31, 2017

Not sure what you mean by QPS data, but I'm sending the number of RPC calls per second.

sum(rate(grpc_server_started_total{job="etcd", grpc_type="unary"}[5m])) by (instance)
image

Edit: sum(rate(grpc_server_started_total{job="etcd", grpc_type="unary"}[5m])) by (grpc_method)
image

sum(rate(process_cpu_seconds_total{job="etcd"}[2m])) by (instance) * 100
image

@danielfm
Copy link
Author

danielfm commented Sep 1, 2017

I managed to take a heap dumps from a running instance, and this is what happened when taking successive heap dumps:

$ go tool pprof ./etcd ./heap.out
File: etcd
Type: inuse_space
Time: Sep 1, 2017 at 1:45pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 216.48MB, 98.18% of 220.49MB total
Dropped 55 nodes (cum <= 1.10MB)
Showing top 10 nodes out of 36
      flat  flat%   sum%        cum   cum%
  191.07MB 86.66% 86.66%   191.07MB 86.66%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb/raft.pb.go
    9.86MB  4.47% 91.13%     9.86MB  4.47%  runtime.makechan /usr/local/go/src/runtime/chan.go
    5.43MB  2.46% 93.59%     5.43MB  2.46%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*MemoryStorage).Append /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/storage.go
    2.35MB  1.07% 94.65%     2.35MB  1.07%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb.(*Snapshot).Unmarshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb/raft.pb.go
    2.31MB  1.05% 95.70%     9.37MB  4.25%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamReader).decodeLoop /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
    2.31MB  1.05% 96.75%     2.31MB  1.05%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamWriter).run /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
       2MB  0.91% 97.66%        2MB  0.91%  runtime.rawstringtmp /usr/local/go/src/runtime/string.go
    1.16MB  0.52% 98.18%     1.16MB  0.52%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/wal.newFileEncoder /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/wal/encoder.go
         0     0% 98.18%        2MB  0.91%  encoding/json.(*decodeState).literal /usr/local/go/src/encoding/json/decode.go
         0     0% 98.18%        2MB  0.91%  encoding/json.(*decodeState).literalStore /usr/local/go/src/encoding/json/decode.go

...
(Some time later)

$ go tool pprof ./etcd ./heap.out
File: etcd
Type: inuse_space
Time: Sep 1, 2017 at 2:04pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 248.08MB, 98.15% of 252.76MB total
Dropped 75 nodes (cum <= 1.26MB)
Showing top 10 nodes out of 36
      flat  flat%   sum%        cum   cum%
  222.32MB 87.96% 87.96%   222.32MB 87.96%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb/raft.pb.go
    9.86MB  3.90% 91.86%     9.86MB  3.90%  runtime.makechan /usr/local/go/src/runtime/chan.go
    5.43MB  2.15% 94.01%     5.43MB  2.15%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*MemoryStorage).Append /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/storage.go
    2.35MB  0.93% 94.94%     2.35MB  0.93%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb.(*Snapshot).Unmarshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb/raft.pb.go
    2.31MB  0.92% 95.85%    40.62MB 16.07%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamReader).decodeLoop /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
    2.31MB  0.92% 96.77%     2.31MB  0.92%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamWriter).run /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
       2MB  0.79% 97.56%        2MB  0.79%  runtime.rawstringtmp /usr/local/go/src/runtime/string.go
    1.50MB  0.59% 98.15%     1.50MB  0.59%  runtime.mapassign /usr/local/go/src/runtime/hashmap.go
         0     0% 98.15%        2MB  0.79%  encoding/json.(*decodeState).literal /usr/local/go/src/encoding/json/decode.go
         0     0% 98.15%        2MB  0.79%  encoding/json.(*decodeState).literalStore /usr/local/go/src/encoding/json/decode.go

...
(Some time later)

$ go tool pprof ./etcd ./heap.out
File: etcd
Type: inuse_space
Time: Sep 1, 2017 at 2:22pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 280.22MB, 97.81% of 286.50MB total
Dropped 82 nodes (cum <= 1.43MB)
Showing top 10 nodes out of 57
      flat  flat%   sum%        cum   cum%
  251.54MB 87.80% 87.80%   251.54MB 87.80%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb/raft.pb.go
    9.86MB  3.44% 91.24%     9.86MB  3.44%  runtime.makechan /usr/local/go/src/runtime/chan.go
    6.79MB  2.37% 93.61%     6.79MB  2.37%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*MemoryStorage).Append /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/storage.go
    2.35MB  0.82% 94.43%     2.35MB  0.82%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb.(*Snapshot).Unmarshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb/raft.pb.go
    2.31MB  0.81% 95.24%    69.85MB 24.38%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamReader).decodeLoop /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
    2.31MB  0.81% 96.04%     2.31MB  0.81%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamWriter).run /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
       2MB   0.7% 96.74%        2MB   0.7%  runtime.rawstringtmp /usr/local/go/src/runtime/string.go
    1.56MB  0.54% 97.29%     2.08MB  0.73%  github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.encode /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/rpc_util.go
    1.50MB  0.52% 97.81%     2.07MB  0.72%  runtime.mapassign /usr/local/go/src/runtime/hashmap.go
         0     0% 97.81%        2MB   0.7%  encoding/json.(*decodeState).literal /usr/local/go/src/encoding/json/decode.go

...

$ go tool pprof ./etcd ./heap.out
File: etcd
Type: inuse_space
Time: Sep 1, 2017 at 2:43pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 315.70MB, 98.72% of 319.79MB total
Dropped 85 nodes (cum <= 1.60MB)
Showing top 10 nodes out of 38
      flat  flat%   sum%        cum   cum%
  286.85MB 89.70% 89.70%   286.85MB 89.70%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb/raft.pb.go
    9.86MB  3.08% 92.78%     9.86MB  3.08%  runtime.makechan /usr/local/go/src/runtime/chan.go
    6.79MB  2.12% 94.90%     6.79MB  2.12%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*MemoryStorage).Append /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/storage.go
    2.35MB  0.73% 95.64%     2.35MB  0.73%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb.(*Snapshot).Unmarshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb/raft.pb.go
    2.31MB  0.72% 96.36%   105.15MB 32.88%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamReader).decodeLoop /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
    2.31MB  0.72% 97.09%     2.31MB  0.72%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamWriter).run /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
       2MB  0.63% 97.71%        2MB  0.63%  runtime.rawstringtmp /usr/local/go/src/runtime/string.go
    1.73MB  0.54% 98.25%     1.73MB  0.54%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/wal.newFileEncoder /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/wal/encoder.go
    1.50MB  0.47% 98.72%     2.07MB  0.65%  runtime.mapassign /usr/local/go/src/runtime/hashmap.go
         0     0% 98.72%        2MB  0.63%  encoding/json.(*decodeState).literal /usr/local/go/src/encoding/json/decode.go

...
(At this point, the heap dump shows what it looks like a GC, but the resident memory usage was still growing at this point)

$ go tool pprof ./etcd ./heap.out
File: etcd
Type: inuse_space
Time: Sep 1, 2017 at 2:45pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 39120.25kB, 90.48% of 43237.15kB total
Showing top 10 nodes out of 130
      flat  flat%   sum%        cum   cum%
12911.36kB 29.86% 29.86% 12911.36kB 29.86%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb/raft.pb.go
10092.32kB 23.34% 53.20% 10092.32kB 23.34%  runtime.makechan /usr/local/go/src/runtime/chan.go
 4736.46kB 10.95% 64.16%  4736.46kB 10.95%  bytes.makeSlice /usr/local/go/src/bytes/buffer.go
 2368.55kB  5.48% 69.64% 15279.91kB 35.34%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamReader).decodeLoop /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
 2368.55kB  5.48% 75.11%  2368.55kB  5.48%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamWriter).run /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
 2048.20kB  4.74% 79.85%  2048.20kB  4.74%  runtime.rawstringtmp /usr/local/go/src/runtime/string.go
 1767.28kB  4.09% 83.94%  1767.28kB  4.09%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/wal.newFileEncoder /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/wal/encoder.go
 1536.24kB  3.55% 87.49%  1536.24kB  3.55%  runtime.mapassign /usr/local/go/src/runtime/hashmap.go
  763.13kB  1.76% 89.26%   763.13kB  1.76%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*MemoryStorage).Append /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/storage.go
  528.17kB  1.22% 90.48%   528.17kB  1.22%  regexp.(*bitState).reset /usr/local/go/src/regexp/backtrack.go

...
(Shortly before the resident memory usage dropped)

$ go tool pprof ./etcd ./heap.out
File: etcd
Type: inuse_space
Time: Sep 1, 2017 at 2:51pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 52.15MB, 91.24% of 57.16MB total
Showing top 10 nodes out of 143
      flat  flat%   sum%        cum   cum%
   25.72MB 45.00% 45.00%    25.72MB 45.00%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb/raft.pb.go
    9.86MB 17.24% 62.24%     9.86MB 17.24%  runtime.makechan /usr/local/go/src/runtime/chan.go
    4.63MB  8.09% 70.33%     4.63MB  8.09%  bytes.makeSlice /usr/local/go/src/bytes/buffer.go
    2.31MB  4.05% 74.38%    28.03MB 49.04%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamReader).decodeLoop /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
    2.31MB  4.05% 78.43%     2.31MB  4.05%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamWriter).run /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
       2MB  3.50% 81.92%        2MB  3.50%  runtime.rawstringtmp /usr/local/go/src/runtime/string.go
    1.73MB  3.02% 84.94%     1.73MB  3.02%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/wal.newFileEncoder /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/wal/encoder.go
    1.56MB  2.73% 87.67%     1.56MB  2.73%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb.(*SnapshotResponse).Marshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver/etcdserverpb/rpc.pb.go
    1.04MB  1.82% 89.49%     2.60MB  4.55%  github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc.encode /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/google.golang.org/grpc/rpc_util.go
       1MB  1.75% 91.24%     1.57MB  2.75%  runtime.mapassign /usr/local/go/src/runtime/hashmap.go

...
(Shortly after the resident memory usage dropped)

$ go tool pprof ./etcd ./heap.out
File: etcd
Type: inuse_space
Time: Sep 1, 2017 at 2:53pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 53.50MB, 94.69% of 56.51MB total
Showing top 10 nodes out of 116
      flat  flat%   sum%        cum   cum%
   28.24MB 49.98% 49.98%    28.24MB 49.98%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb.(*Entry).Unmarshal /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raftpb/raft.pb.go
    9.86MB 17.44% 67.42%     9.86MB 17.44%  runtime.makechan /usr/local/go/src/runtime/chan.go
    4.63MB  8.19% 75.60%     4.63MB  8.19%  bytes.makeSlice /usr/local/go/src/bytes/buffer.go
    2.31MB  4.09% 79.70%    30.55MB 54.07%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamReader).decodeLoop /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
    2.31MB  4.09% 83.79%     2.31MB  4.09%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp.(*streamWriter).run /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/rafthttp/stream.go
       2MB  3.54% 87.33%        2MB  3.54%  runtime.rawstringtmp /usr/local/go/src/runtime/string.go
    1.73MB  3.05% 90.38%     1.73MB  3.05%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/wal.newFileEncoder /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/wal/encoder.go
       1MB  1.77% 92.15%        1MB  1.77%  runtime.mapassign /usr/local/go/src/runtime/hashmap.go
    0.91MB  1.62% 93.77%     0.91MB  1.62%  github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*MemoryStorage).Append /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/storage.go
    0.52MB  0.91% 94.69%     0.52MB  0.91%  regexp.(*bitState).reset /usr/local/go/src/regexp/backtrack.go

In summary, what I observed was that the heap usage went consistently up (with just one entry being responsible for >80% of the used heap space, if I interpreted the outputs correctly), but when those objects got reclaimed, the resident memory usage took 5-10 minutes to drop, so those events look related although I expected the resident memory usage to drop at the same time. (This delay might have something to do with how/when Go releases memory back to the operating system, I'm not a Go expert).

Is this information useful at all?

@danielfm
Copy link
Author

danielfm commented Sep 1, 2017

I also enabled the GC tracing, this is the log from the last 2h (later I'll try to capture logs around the moment the RSS drops again):

Sep 01 18:07:11 etcd-wrapper[15450]: gc 1 @0.031s 3%: 0.026+1.3+1.2 ms clock, 0.053+0/0/2.3+2.5 ms cpu, 4->4->1 MB, 5 MB goal, 2 P
Sep 01 18:07:11 etcd-wrapper[15450]: gc 2 @0.338s 1%: 0.014+1.3+2.2 ms clock, 0.028+0/1.2/1.3+4.4 ms cpu, 4->4->1 MB, 5 MB goal, 2 P
Sep 01 18:07:11 etcd-wrapper[15450]: gc 3 @0.429s 1%: 0.016+1.4+1.2 ms clock, 0.032+0.006/1.1/1.1+2.5 ms cpu, 4->4->1 MB, 5 MB goal, 2 P
Sep 01 18:07:11 etcd-wrapper[15450]: gc 1 @0.027s 19%: 4.7+2.0+2.3 ms clock, 9.4+0/0/6.5+4.7 ms cpu, 4->4->3 MB, 5 MB goal, 2 P
Sep 01 18:07:11 etcd-wrapper[15450]: gc 2 @0.042s 23%: 3.1+4.7+1.9 ms clock, 6.2+0/0.53/4.3+3.8 ms cpu, 7->7->5 MB, 8 MB goal, 2 P
Sep 01 18:07:11 etcd-wrapper[15450]: gc 3 @0.118s 13%: 0.054+15+2.7 ms clock, 0.10+0.27/5.2/0+5.5 ms cpu, 9->9->5 MB, 11 MB goal, 2 P
Sep 01 18:07:11 etcd-wrapper[15450]: gc 4 @0.164s 13%: 0.016+5.9+3.8 ms clock, 0.033+2.9/0.49/4.1+7.7 ms cpu, 10->11->8 MB, 11 MB goal, 2 P
Sep 01 18:07:11 etcd-wrapper[15450]: gc 5 @0.180s 16%: 0.009+5.6+4.1 ms clock, 0.019+4.7/3.7/0+8.2 ms cpu, 15->16->9 MB, 17 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 6 @0.196s 20%: 0.034+7.8+4.8 ms clock, 0.068+6.4/3.7/0+9.7 ms cpu, 17->18->12 MB, 19 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 7 @0.215s 23%: 0.036+7.6+5.3 ms clock, 0.073+2.7/7.6/0+10 ms cpu, 20->21->15 MB, 23 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 8 @0.236s 24%: 0.46+4.8+3.1 ms clock, 0.92+4.2/4.4/0+6.2 ms cpu, 17->17->7 MB, 19 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 9 @0.250s 25%: 2.1+3.5+3.0 ms clock, 4.3+2.4/0.13/5.6+6.1 ms cpu, 11->12->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 10 @0.263s 27%: 0.036+3.5+3.0 ms clock, 0.073+2.5/3.5/0+6.1 ms cpu, 11->12->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 11 @0.274s 28%: 0.016+3.5+3.0 ms clock, 0.033+2.5/3.4/0+6.0 ms cpu, 11->12->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 12 @0.284s 29%: 0.010+4.9+3.1 ms clock, 0.020+0.99/4.7/0+6.2 ms cpu, 11->12->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 13 @0.295s 29%: 0.013+3.3+3.0 ms clock, 0.027+2.3/0/3.2+6.0 ms cpu, 11->12->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 14 @0.305s 30%: 0.011+3.4+3.0 ms clock, 0.023+2.4/0.19/3.2+6.0 ms cpu, 11->12->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 15 @0.315s 30%: 0.012+3.7+2.9 ms clock, 0.025+2.5/0/3.2+5.9 ms cpu, 11->12->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 16 @0.325s 30%: 0.012+3.7+3.0 ms clock, 0.024+2.3/0/3.7+6.0 ms cpu, 11->12->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 17 @0.334s 31%: 0.012+3.6+3.0 ms clock, 0.025+2.4/1.8/1.7+6.0 ms cpu, 11->12->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 18 @0.344s 31%: 0.012+3.5+2.9 ms clock, 0.025+2.2/0.23/3.1+5.9 ms cpu, 10->12->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 19 @0.353s 32%: 0.011+3.7+3.0 ms clock, 0.023+2.2/0/3.7+6.0 ms cpu, 10->12->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 20 @0.362s 32%: 0.036+3.9+3.0 ms clock, 0.073+1.9/3.8/0.005+6.0 ms cpu, 10->11->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 21 @0.372s 33%: 0.012+3.8+2.9 ms clock, 0.025+2.2/0/3.7+5.9 ms cpu, 10->11->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 22 @0.381s 33%: 0.012+4.0+3.6 ms clock, 0.025+2.2/0/4.0+7.3 ms cpu, 10->11->7 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 23 @0.392s 33%: 0.013+3.9+2.9 ms clock, 0.026+2.2/0/3.9+5.9 ms cpu, 10->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 24 @0.401s 34%: 0.013+4.0+3.0 ms clock, 0.027+2.2/0.020/3.9+6.0 ms cpu, 10->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 25 @0.411s 34%: 0.012+4.5+3.0 ms clock, 0.025+2.9/1.3/0.89+6.0 ms cpu, 10->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 26 @0.420s 34%: 0.037+4.4+2.9 ms clock, 0.074+2.6/2.1/0+5.9 ms cpu, 10->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 27 @0.430s 35%: 0.012+3.9+2.9 ms clock, 0.025+2.1/0.009/3.9+5.9 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 28 @0.439s 35%: 0.011+4.0+2.9 ms clock, 0.023+2.0/0/3.9+5.9 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 29 @0.449s 35%: 0.012+4.2+2.9 ms clock, 0.024+2.3/1.4/2.4+5.9 ms cpu, 10->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 30 @0.458s 36%: 0.037+4.0+3.0 ms clock, 0.074+2.0/3.3/0.44+6.0 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 31 @0.468s 36%: 0.011+4.0+2.9 ms clock, 0.023+2.0/0.32/3.6+5.9 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 32 @0.477s 36%: 0.012+4.0+2.9 ms clock, 0.025+2.1/1.7/2.3+5.9 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 33 @0.486s 36%: 0.012+4.0+2.9 ms clock, 0.024+2.1/0/3.9+5.9 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 34 @0.495s 36%: 0.012+4.0+3.0 ms clock, 0.024+2.0/0/3.9+6.0 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 35 @0.504s 37%: 0.012+4.0+3.0 ms clock, 0.025+2.1/0.082/3.9+6.0 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 36 @0.514s 37%: 0.013+4.1+2.9 ms clock, 0.026+2.1/1.7/2.3+5.9 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 37 @0.523s 37%: 0.013+4.2+2.9 ms clock, 0.026+2.1/0/4.1+5.9 ms cpu, 10->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 38 @0.533s 37%: 0.12+4.5+2.9 ms clock, 0.25+2.5/4.3/0.001+5.9 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 39 @0.542s 38%: 0.011+4.1+2.9 ms clock, 0.023+1.8/0.82/3.4+5.9 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 40 @0.551s 38%: 0.012+4.1+3.0 ms clock, 0.025+2.1/0/4.0+6.0 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 41 @0.560s 38%: 0.012+4.1+3.2 ms clock, 0.024+2.0/2.0/2.0+6.4 ms cpu, 9->11->8 MB, 13 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 42 @0.570s 38%: 0.012+4.4+3.1 ms clock, 0.025+2.0/0.21/4.2+6.2 ms cpu, 11->13->10 MB, 15 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 43 @0.584s 38%: 0.010+5.4+3.2 ms clock, 0.021+1.5/4.8/0+6.5 ms cpu, 13->16->12 MB, 18 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 44 @0.597s 38%: 0.013+5.0+3.5 ms clock, 0.027+1.7/2.2/2.7+7.0 ms cpu, 17->20->15 MB, 22 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 45 @0.611s 38%: 0.014+5.7+3.5 ms clock, 0.028+1.4/0.32/5.2+7.0 ms cpu, 22->24->18 MB, 27 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 46 @0.630s 38%: 0.015+6.0+4.8 ms clock, 0.030+1.9/4.9/0.99+9.6 ms cpu, 28->31->23 MB, 34 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 47 @0.653s 38%: 0.017+6.2+3.8 ms clock, 0.035+1.9/0.55/5.6+7.6 ms cpu, 37->39->28 MB, 43 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 48 @0.679s 37%: 0.021+7.1+4.0 ms clock, 0.042+1.6/2.0/5.0+8.0 ms cpu, 47->50->36 MB, 54 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 49 @0.711s 36%: 0.022+7.2+4.4 ms clock, 0.044+1.8/5.0/2.4+8.9 ms cpu, 62->65->46 MB, 69 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 50 @0.758s 36%: 0.11+9.5+8.1 ms clock, 0.23+5.1/9.4/0+16 ms cpu, 81->84->55 MB, 88 MB goal, 2 P
Sep 01 18:07:12 etcd-wrapper[15450]: gc 51 @0.827s 35%: 0.057+19+12 ms clock, 0.11+11/12/0+24 ms cpu, 102->107->69 MB, 110 MB goal, 2 P
Sep 01 18:07:22 etcd-wrapper[15450]: gc 52 @10.248s 3%: 0.043+14+13 ms clock, 0.086+1.2/10/15+27 ms cpu, 116->117->66 MB, 128 MB goal, 2 P
Sep 01 18:07:43 etcd-wrapper[15450]: gc 53 @31.995s 1%: 0.046+30+14 ms clock, 0.092+4.0/15/10+28 ms cpu, 126->128->68 MB, 132 MB goal, 2 P
Sep 01 18:07:44 etcd-wrapper[15450]: gc 54 @32.561s 1%: 0.045+30+14 ms clock, 0.090+5.3/16/6.4+29 ms cpu, 130->131->68 MB, 134 MB goal, 2 P
Sep 01 18:08:18 etcd-wrapper[15450]: gc 55 @66.385s 0%: 0.14+13+13 ms clock, 0.29+0.39/10/13+27 ms cpu, 131->131->67 MB, 134 MB goal, 2 P
Sep 01 18:08:53 etcd-wrapper[15450]: gc 56 @101.883s 0%: 0.017+25+14 ms clock, 0.034+9.7/12/6.0+29 ms cpu, 132->133->70 MB, 135 MB goal, 2 P
Sep 01 18:08:54 etcd-wrapper[15450]: gc 57 @102.455s 0%: 0.032+28+14 ms clock, 0.064+5.2/14/10+28 ms cpu, 134->136->70 MB, 138 MB goal, 2 P
Sep 01 18:09:21 etcd-wrapper[15450]: gc 58 @129.507s 0%: 0.019+13+14 ms clock, 0.039+0.62/11/13+28 ms cpu, 134->134->69 MB, 138 MB goal, 2 P
Sep 01 18:09:41 etcd-wrapper[15450]: scvg0: inuse: 108, idle: 57, sys: 166, released: 0, consumed: 166 (MB)
Sep 01 18:10:03 etcd-wrapper[15450]: gc 59 @171.924s 0%: 0.061+24+14 ms clock, 0.12+5.4/14/7.2+29 ms cpu, 135->137->72 MB, 139 MB goal, 2 P
Sep 01 18:10:04 etcd-wrapper[15450]: gc 60 @172.505s 0%: 2.3+27+14 ms clock, 4.7+2.9/17/6.0+28 ms cpu, 138->140->72 MB, 141 MB goal, 2 P
Sep 01 18:10:33 etcd-wrapper[15450]: gc 61 @201.557s 0%: 0.021+13+14 ms clock, 0.042+0.28/7.1/17+29 ms cpu, 138->138->71 MB, 142 MB goal, 2 P
Sep 01 18:11:12 etcd-wrapper[15450]: gc 62 @240.968s 0%: 0.021+26+15 ms clock, 0.043+5.6/13/11+30 ms cpu, 139->140->74 MB, 143 MB goal, 2 P
Sep 01 18:11:13 etcd-wrapper[15450]: gc 63 @241.568s 0%: 0.059+24+14 ms clock, 0.11+4.3/13/15+28 ms cpu, 142->143->73 MB, 145 MB goal, 2 P
Sep 01 18:11:58 etcd-wrapper[15450]: gc 64 @286.641s 0%: 0.046+13+13 ms clock, 0.093+0.59/13/12+27 ms cpu, 142->142->73 MB, 145 MB goal, 2 P
Sep 01 18:12:11 etcd-wrapper[15450]: scvg1: inuse: 100, idle: 69, sys: 170, released: 0, consumed: 170 (MB)
Sep 01 18:12:22 etcd-wrapper[15450]: gc 65 @311.087s 0%: 0.044+24+14 ms clock, 0.088+11/12/1.0+29 ms cpu, 144->145->76 MB, 147 MB goal, 2 P
Sep 01 18:12:23 etcd-wrapper[15450]: gc 66 @311.704s 0%: 0.045+30+14 ms clock, 0.091+7.6/17/10+29 ms cpu, 146->147->76 MB, 149 MB goal, 2 P
Sep 01 18:13:18 etcd-wrapper[15450]: gc 67 @366.841s 0%: 0.027+17+14 ms clock, 0.055+0/13/13+28 ms cpu, 146->146->76 MB, 149 MB goal, 2 P
Sep 01 18:13:33 etcd-wrapper[15450]: gc 68 @381.222s 0%: 0.25+29+14 ms clock, 0.51+5.9/20/3.6+29 ms cpu, 148->150->78 MB, 152 MB goal, 2 P
Sep 01 18:13:42 etcd-wrapper[15450]: gc 69 @390.609s 0%: 0.53+16+14 ms clock, 1.0+0.51/5.6/15+28 ms cpu, 149->149->77 MB, 153 MB goal, 2 P
Sep 01 18:14:41 etcd-wrapper[15450]: scvg2: inuse: 151, idle: 21, sys: 173, released: 0, consumed: 173 (MB)
Sep 01 18:14:42 etcd-wrapper[15450]: gc 70 @450.744s 0%: 0.062+22+15 ms clock, 0.12+4.7/10/11+30 ms cpu, 150->151->79 MB, 154 MB goal, 2 P
Sep 01 18:14:43 etcd-wrapper[15450]: gc 71 @451.380s 0%: 2.3+25+15 ms clock, 4.6+6.0/15/9.3+30 ms cpu, 153->155->80 MB, 157 MB goal, 2 P
Sep 01 18:15:17 etcd-wrapper[15450]: gc 72 @485.685s 0%: 0.13+13+14 ms clock, 0.26+0.59/12/11+28 ms cpu, 154->154->79 MB, 158 MB goal, 2 P
Sep 01 18:15:52 etcd-wrapper[15450]: gc 73 @521.048s 0%: 0.045+26+15 ms clock, 0.091+6.2/18/4.6+30 ms cpu, 155->156->82 MB, 159 MB goal, 2 P
Sep 01 18:15:53 etcd-wrapper[15450]: gc 74 @521.715s 0%: 0.016+36+14 ms clock, 0.032+0.56/17/9.4+29 ms cpu, 157->159->83 MB, 161 MB goal, 2 P
Sep 01 18:17:02 etcd-wrapper[15450]: gc 75 @590.767s 0%: 0.059+23+15 ms clock, 0.11+9.3/11/12+31 ms cpu, 157->158->83 MB, 161 MB goal, 2 P
Sep 01 18:17:03 etcd-wrapper[15450]: gc 76 @591.439s 0%: 0.043+28+14 ms clock, 0.087+3.0/16/7.3+29 ms cpu, 161->162->84 MB, 165 MB goal, 2 P
Sep 01 18:17:11 etcd-wrapper[15450]: scvg3: inuse: 140, idle: 50, sys: 190, released: 0, consumed: 190 (MB)
Sep 01 18:17:36 etcd-wrapper[15450]: gc 77 @624.847s 0%: 0.019+13+14 ms clock, 0.039+1.2/11/13+29 ms cpu, 161->161->83 MB, 165 MB goal, 2 P
Sep 01 18:18:13 etcd-wrapper[15450]: gc 78 @662.031s 0%: 0.047+25+15 ms clock, 0.095+9.4/20/4.1+30 ms cpu, 162->164->86 MB, 166 MB goal, 2 P
Sep 01 18:18:14 etcd-wrapper[15450]: gc 79 @662.713s 0%: 0.043+25+15 ms clock, 0.087+4.8/15/5.8+30 ms cpu, 165->166->85 MB, 169 MB goal, 2 P
Sep 01 18:19:20 etcd-wrapper[15450]: gc 80 @728.539s 0%: 0.021+15+14 ms clock, 0.043+0.55/13/16+29 ms cpu, 165->165->86 MB, 169 MB goal, 2 P
Sep 01 18:19:22 etcd-wrapper[15450]: gc 81 @730.345s 0%: 0.052+28+15 ms clock, 0.10+3.2/13/12+30 ms cpu, 168->169->87 MB, 172 MB goal, 2 P
Sep 01 18:19:41 etcd-wrapper[15450]: scvg4: inuse: 155, idle: 40, sys: 196, released: 0, consumed: 196 (MB)
Sep 01 18:20:01 etcd-wrapper[15450]: gc 82 @769.443s 0%: 0.045+14+15 ms clock, 0.090+1.0/13/12+31 ms cpu, 168->168->87 MB, 173 MB goal, 2 P
Sep 01 18:20:32 etcd-wrapper[15450]: gc 83 @800.152s 0%: 0.017+26+15 ms clock, 0.034+5.5/14/11+31 ms cpu, 170->171->89 MB, 174 MB goal, 2 P
Sep 01 18:20:47 etcd-wrapper[15450]: gc 84 @815.781s 0%: 0.020+14+14 ms clock, 0.041+0.27/8.5/16+28 ms cpu, 172->172->88 MB, 177 MB goal, 2 P
Sep 01 18:21:40 etcd-wrapper[15450]: gc 85 @869.000s 0%: 0.045+26+15 ms clock, 0.091+12/13/7.0+31 ms cpu, 173->174->91 MB, 177 MB goal, 2 P
Sep 01 18:21:41 etcd-wrapper[15450]: gc 86 @869.741s 0%: 0.17+23+15 ms clock, 0.35+1.0/12/14+30 ms cpu, 176->177->91 MB, 181 MB goal, 2 P
Sep 01 18:22:11 etcd-wrapper[15450]: scvg5: inuse: 133, idle: 72, sys: 205, released: 0, consumed: 205 (MB)
Sep 01 18:22:49 etcd-wrapper[15450]: gc 87 @937.372s 0%: 0.026+13+15 ms clock, 0.052+0.33/13/13+30 ms cpu, 176->176->92 MB, 180 MB goal, 2 P
Sep 01 18:22:54 etcd-wrapper[15450]: gc 88 @942.409s 0%: 0.069+32+15 ms clock, 0.13+5.1/16/8.3+31 ms cpu, 180->182->94 MB, 185 MB goal, 2 P
Sep 01 18:23:38 etcd-wrapper[15450]: gc 89 @986.638s 0%: 0.14+14+15 ms clock, 0.29+0.33/13/14+31 ms cpu, 181->181->93 MB, 186 MB goal, 2 P
Sep 01 18:24:04 etcd-wrapper[15450]: gc 90 @1012.237s 0%: 0.057+25+15 ms clock, 0.11+3.5/13/13+31 ms cpu, 183->184->95 MB, 187 MB goal, 2 P
Sep 01 18:24:25 etcd-wrapper[15450]: gc 91 @1033.578s 0%: 0.021+13+15 ms clock, 0.042+0.21/13/13+30 ms cpu, 185->185->95 MB, 189 MB goal, 2 P
Sep 01 18:24:41 etcd-wrapper[15450]: scvg6: inuse: 125, idle: 85, sys: 211, released: 0, consumed: 211 (MB)
Sep 01 18:25:13 etcd-wrapper[15450]: gc 92 @1082.055s 0%: 0.76+27+16 ms clock, 1.5+7.7/14/10+32 ms cpu, 186->187->98 MB, 190 MB goal, 2 P
Sep 01 18:25:18 etcd-wrapper[15450]: gc 93 @1086.930s 0%: 0.60+14+15 ms clock, 1.2+0.76/10/16+31 ms cpu, 189->189->97 MB, 194 MB goal, 2 P
Sep 01 18:26:23 etcd-wrapper[15450]: gc 94 @1151.934s 0%: 0.017+24+16 ms clock, 0.034+3.3/14/13+33 ms cpu, 189->190->99 MB, 193 MB goal, 2 P
Sep 01 18:26:24 etcd-wrapper[15450]: gc 95 @1152.727s 0%: 0.015+36+16 ms clock, 0.030+1.3/18/11+32 ms cpu, 193->195->101 MB, 197 MB goal, 2 P
Sep 01 18:27:11 etcd-wrapper[15450]: scvg7: inuse: 158, idle: 68, sys: 226, released: 0, consumed: 226 (MB)
Sep 01 18:27:33 etcd-wrapper[15450]: gc 96 @1221.904s 0%: 2.1+26+16 ms clock, 4.2+3.2/15/9.8+32 ms cpu, 193->194->102 MB, 198 MB goal, 2 P
Sep 01 18:27:34 etcd-wrapper[15450]: gc 97 @1222.719s 0%: 0.033+38+16 ms clock, 0.066+10/25/7.1+33 ms cpu, 196->199->103 MB, 201 MB goal, 2 P
Sep 01 18:28:43 etcd-wrapper[15450]: gc 98 @1291.789s 0%: 0.69+24+16 ms clock, 1.3+11/13/4.2+33 ms cpu, 197->198->103 MB, 202 MB goal, 2 P
Sep 01 18:28:44 etcd-wrapper[15450]: gc 99 @1292.619s 0%: 0.043+32+16 ms clock, 0.086+2.0/16/9.8+32 ms cpu, 200->202->104 MB, 205 MB goal, 2 P
Sep 01 18:29:41 etcd-wrapper[15450]: scvg8: inuse: 184, idle: 53, sys: 237, released: 0, consumed: 237 (MB)
Sep 01 18:29:53 etcd-wrapper[15450]: gc 100 @1361.783s 0%: 2.8+24+16 ms clock, 5.7+4.0/15/9.3+33 ms cpu, 200->201->105 MB, 205 MB goal, 2 P
Sep 01 18:29:54 etcd-wrapper[15450]: gc 101 @1362.628s 0%: 1.5+39+16 ms clock, 3.1+5.7/24/8.2+32 ms cpu, 204->206->107 MB, 209 MB goal, 2 P
Sep 01 18:31:03 etcd-wrapper[15450]: gc 102 @1431.878s 0%: 0.045+26+16 ms clock, 0.090+9.2/15/6.8+33 ms cpu, 204->205->107 MB, 209 MB goal, 2 P
Sep 01 18:31:05 etcd-wrapper[15450]: gc 103 @1434.009s 0%: 0.14+20+34 ms clock, 0.29+0.63/14/21+69 ms cpu, 208->208->106 MB, 213 MB goal, 2 P
Sep 01 18:32:11 etcd-wrapper[15450]: scvg9: inuse: 184, idle: 59, sys: 243, released: 0, consumed: 243 (MB)
Sep 01 18:32:13 etcd-wrapper[15450]: gc 104 @1501.966s 0%: 0.050+28+17 ms clock, 0.10+11/16/3.5+35 ms cpu, 207->209->110 MB, 213 MB goal, 2 P
Sep 01 18:32:22 etcd-wrapper[15450]: gc 105 @1510.189s 0%: 0.019+15+16 ms clock, 0.038+0.40/12/16+32 ms cpu, 212->212->109 MB, 218 MB goal, 2 P
Sep 01 18:33:23 etcd-wrapper[15450]: gc 106 @1571.964s 0%: 0.024+29+17 ms clock, 0.048+12/17/5.6+34 ms cpu, 212->214->112 MB, 218 MB goal, 2 P
Sep 01 18:33:37 etcd-wrapper[15450]: gc 107 @1585.313s 0%: 0.12+16+16 ms clock, 0.24+0.42/16/12+33 ms cpu, 216->216->111 MB, 222 MB goal, 2 P
Sep 01 18:34:33 etcd-wrapper[15450]: gc 108 @1642.093s 0%: 0.15+29+17 ms clock, 0.30+14/14/9.8+35 ms cpu, 216->218->114 MB, 222 MB goal, 2 P
Sep 01 18:34:41 etcd-wrapper[15450]: scvg10: inuse: 212, idle: 32, sys: 245, released: 0, consumed: 245 (MB)
Sep 01 18:35:04 etcd-wrapper[15450]: gc 109 @1672.782s 0%: 0.039+17+17 ms clock, 0.079+0.20/14/11+34 ms cpu, 220->220->113 MB, 225 MB goal, 2 P
Sep 01 18:35:44 etcd-wrapper[15450]: gc 110 @1712.365s 0%: 0.017+25+17 ms clock, 0.035+0.33/14/20+35 ms cpu, 221->222->115 MB, 227 MB goal, 2 P
Sep 01 18:36:43 etcd-wrapper[15450]: gc 111 @1771.215s 0%: 0.020+20+17 ms clock, 0.041+0.19/13/18+35 ms cpu, 223->223->116 MB, 229 MB goal, 2 P
Sep 01 18:36:54 etcd-wrapper[15450]: gc 112 @1782.571s 0%: 0.023+36+17 ms clock, 0.047+4.4/23/11+34 ms cpu, 226->229->119 MB, 232 MB goal, 2 P
Sep 01 18:37:11 etcd-wrapper[15450]: scvg11: inuse: 161, idle: 93, sys: 255, released: 0, consumed: 255 (MB)
Sep 01 18:38:03 etcd-wrapper[15450]: gc 113 @1851.910s 0%: 0.045+29+18 ms clock, 0.090+13/14/9.2+36 ms cpu, 227->229->120 MB, 233 MB goal, 2 P
Sep 01 18:38:12 etcd-wrapper[15450]: gc 114 @1860.498s 0%: 0.10+16+17 ms clock, 0.20+0.82/10/17+35 ms cpu, 231->231->118 MB, 237 MB goal, 2 P
Sep 01 18:39:13 etcd-wrapper[15450]: gc 115 @1922.054s 0%: 1.8+27+17 ms clock, 3.6+4.3/15/14+35 ms cpu, 231->232->122 MB, 237 MB goal, 2 P
Sep 01 18:39:41 etcd-wrapper[15450]: gc 116 @1950.029s 0%: 0.018+15+17 ms clock, 0.037+0.22/13/15+34 ms cpu, 235->235->121 MB, 241 MB goal, 2 P
Sep 01 18:39:41 etcd-wrapper[15450]: scvg12: inuse: 138, idle: 127, sys: 265, released: 0, consumed: 265 (MB)
Sep 01 18:40:24 etcd-wrapper[15450]: gc 117 @1992.243s 0%: 2.4+31+17 ms clock, 4.9+6.4/18/5.0+34 ms cpu, 236->238->124 MB, 242 MB goal, 2 P
Sep 01 18:41:24 etcd-wrapper[15450]: gc 118 @2052.422s 0%: 0.022+19+17 ms clock, 0.044+0.27/12/16+34 ms cpu, 239->239->124 MB, 245 MB goal, 2 P
Sep 01 18:41:33 etcd-wrapper[15450]: gc 119 @2061.614s 0%: 0.72+41+17 ms clock, 1.4+4.2/20/9.3+34 ms cpu, 241->244->127 MB, 247 MB goal, 2 P
Sep 01 18:42:11 etcd-wrapper[15450]: scvg13: inuse: 175, idle: 100, sys: 276, released: 0, consumed: 276 (MB)
Sep 01 18:42:42 etcd-wrapper[15450]: gc 120 @2131.000s 0%: 0.046+29+17 ms clock, 0.092+5.6/18/9.9+35 ms cpu, 242->244->127 MB, 249 MB goal, 2 P
Sep 01 18:43:14 etcd-wrapper[15450]: gc 121 @2162.543s 0%: 0.019+17+17 ms clock, 0.038+0.35/3.9/22+34 ms cpu, 246->246->127 MB, 252 MB goal, 2 P
Sep 01 18:43:52 etcd-wrapper[15450]: gc 122 @2200.348s 0%: 0.045+27+18 ms clock, 0.091+9.2/16/7.3+36 ms cpu, 247->249->129 MB, 254 MB goal, 2 P
Sep 01 18:44:41 etcd-wrapper[15450]: scvg14: inuse: 236, idle: 39, sys: 276, released: 0, consumed: 276 (MB)
Sep 01 18:44:59 etcd-wrapper[15450]: gc 123 @2267.716s 0%: 0.10+15+18 ms clock, 0.21+0.90/15/14+36 ms cpu, 250->250->130 MB, 256 MB goal, 2 P
Sep 01 18:45:04 etcd-wrapper[15450]: gc 124 @2272.683s 0%: 0.061+20+38 ms clock, 0.12+1.8/10/27+76 ms cpu, 254->254->130 MB, 261 MB goal, 2 P
Sep 01 18:46:13 etcd-wrapper[15450]: gc 125 @2341.175s 0%: 0.034+33+18 ms clock, 0.069+6.3/17/14+36 ms cpu, 254->256->134 MB, 261 MB goal, 2 P
Sep 01 18:47:11 etcd-wrapper[15450]: scvg15: inuse: 261, idle: 24, sys: 286, released: 0, consumed: 286 (MB)
Sep 01 18:47:15 etcd-wrapper[15450]: gc 126 @2403.211s 0%: 0.097+17+18 ms clock, 0.19+1.7/10/18+36 ms cpu, 259->259->134 MB, 265 MB goal, 2 P
Sep 01 18:47:23 etcd-wrapper[15450]: gc 127 @2411.702s 0%: 0.081+32+18 ms clock, 0.16+5.4/16/10+36 ms cpu, 262->263->136 MB, 268 MB goal, 2 P
Sep 01 18:48:32 etcd-wrapper[15450]: gc 128 @2481.056s 0%: 0.019+27+19 ms clock, 0.038+9.8/15/12+38 ms cpu, 262->264->137 MB, 269 MB goal, 2 P
Sep 01 18:49:21 etcd-wrapper[15450]: gc 129 @2529.931s 0%: 0.034+18+19 ms clock, 0.069+0.53/3.4/23+38 ms cpu, 266->266->138 MB, 273 MB goal, 2 P
Sep 01 18:49:41 etcd-wrapper[15450]: scvg16: inuse: 172, idle: 120, sys: 293, released: 0, consumed: 293 (MB)
Sep 01 18:49:43 etcd-wrapper[15450]: gc 130 @2551.600s 0%: 0.020+43+18 ms clock, 0.040+7.0/20/9.9+36 ms cpu, 268->271->141 MB, 275 MB goal, 2 P
Sep 01 18:50:51 etcd-wrapper[15450]: gc 131 @2620.082s 0%: 0.036+33+19 ms clock, 0.073+10/19/9.1+38 ms cpu, 270->272->142 MB, 277 MB goal, 2 P
Sep 01 18:51:53 etcd-wrapper[15450]: gc 132 @2681.635s 0%: 0.17+15+18 ms clock, 0.34+0.38/14/16+36 ms cpu, 274->274->142 MB, 281 MB goal, 2 P
Sep 01 18:52:11 etcd-wrapper[15450]: gc 133 @2700.094s 0%: 0.019+15+18 ms clock, 0.039+0.26/8.8/21+36 ms cpu, 277->277->142 MB, 284 MB goal, 2 P
Sep 01 18:52:12 etcd-wrapper[15450]: scvg17: inuse: 160, idle: 140, sys: 301, released: 0, consumed: 301 (MB)
Sep 01 18:53:13 etcd-wrapper[15450]: gc 134 @2761.259s 0%: 0.018+35+18 ms clock, 0.037+14/22/5.5+37 ms cpu, 278->280->146 MB, 285 MB goal, 2 P
Sep 01 18:54:16 etcd-wrapper[15450]: gc 135 @2825.082s 0%: 0.043+17+19 ms clock, 0.086+1.4/17/14+38 ms cpu, 281->282->146 MB, 289 MB goal, 2 P
Sep 01 18:54:32 etcd-wrapper[15450]: gc 136 @2840.793s 0%: 0.017+18+18 ms clock, 0.035+2.0/12/17+37 ms cpu, 285->285->146 MB, 292 MB goal, 2 P
Sep 01 18:54:42 etcd-wrapper[15450]: scvg18: inuse: 172, idle: 137, sys: 309, released: 0, consumed: 309 (MB)
Sep 01 18:55:29 etcd-wrapper[15450]: gc 137 @2897.340s 0%: 0.055+38+20 ms clock, 0.11+12/24/4.0+40 ms cpu, 286->288->150 MB, 293 MB goal, 2 P
Sep 01 18:56:33 etcd-wrapper[15450]: gc 138 @2962.005s 0%: 0.023+30+19 ms clock, 0.046+3.2/15/14+39 ms cpu, 289->290->151 MB, 296 MB goal, 2 P
Sep 01 18:57:12 etcd-wrapper[15450]: scvg19: inuse: 279, idle: 29, sys: 309, released: 0, consumed: 309 (MB)
Sep 01 18:57:26 etcd-wrapper[15450]: gc 139 @3014.747s 0%: 0.021+17+18 ms clock, 0.043+1.1/7.9/22+37 ms cpu, 292->292->151 MB, 300 MB goal, 2 P
Sep 01 18:57:47 etcd-wrapper[15450]: gc 140 @3035.194s 0%: 0.019+18+18 ms clock, 0.038+0.46/4.8/26+37 ms cpu, 295->295->152 MB, 302 MB goal, 2 P
Sep 01 18:58:54 etcd-wrapper[15450]: gc 141 @3102.279s 0%: 0.046+36+19 ms clock, 0.092+5.0/23/8.5+38 ms cpu, 296->298->155 MB, 304 MB goal, 2 P
Sep 01 18:59:42 etcd-wrapper[15450]: scvg20: inuse: 269, idle: 52, sys: 322, released: 0, consumed: 322 (MB)
Sep 01 19:00:03 etcd-wrapper[15450]: gc 142 @3171.941s 0%: 0.049+33+20 ms clock, 0.099+8.1/16/12+41 ms cpu, 300->301->158 MB, 307 MB goal, 2 P
Sep 01 19:00:44 etcd-wrapper[15450]: gc 143 @3212.367s 0%: 0.031+19+19 ms clock, 0.063+0.21/15/15+38 ms cpu, 305->305->157 MB, 313 MB goal, 2 P
Sep 01 19:01:15 etcd-wrapper[15450]: gc 144 @3243.351s 0%: 0.040+36+42 ms clock, 0.081+1.1/22/32+84 ms cpu, 307->307->158 MB, 315 MB goal, 2 P
Sep 01 19:02:12 etcd-wrapper[15450]: scvg21: inuse: 226, idle: 100, sys: 327, released: 0, consumed: 327 (MB)
Sep 01 19:02:24 etcd-wrapper[15450]: gc 145 @3312.296s 0%: 0.024+36+20 ms clock, 0.049+5.4/19/10+41 ms cpu, 309->311->162 MB, 317 MB goal, 2 P
Sep 01 19:03:33 etcd-wrapper[15450]: gc 146 @3381.948s 0%: 0.017+37+21 ms clock, 0.035+5.9/22/9.7+42 ms cpu, 313->315->164 MB, 321 MB goal, 2 P
Sep 01 19:04:20 etcd-wrapper[15450]: gc 147 @3428.538s 0%: 0.15+17+20 ms clock, 0.31+0.37/17/15+41 ms cpu, 317->317->163 MB, 325 MB goal, 2 P
Sep 01 19:04:42 etcd-wrapper[15450]: scvg22: inuse: 199, idle: 132, sys: 331, released: 0, consumed: 331 (MB)
Sep 01 19:04:54 etcd-wrapper[15450]: gc 148 @3462.395s 0%: 0.020+18+19 ms clock, 0.041+0.36/17/13+39 ms cpu, 319->319->164 MB, 327 MB goal, 2 P
Sep 01 19:05:54 etcd-wrapper[15450]: gc 149 @3522.423s 0%: 0.021+27+20 ms clock, 0.043+7.8/13/15+41 ms cpu, 321->322->167 MB, 329 MB goal, 2 P
Sep 01 19:07:04 etcd-wrapper[15450]: gc 150 @3592.173s 0%: 0.019+31+21 ms clock, 0.039+0.45/16/18+43 ms cpu, 325->326->170 MB, 333 MB goal, 2 P
Sep 01 19:07:12 etcd-wrapper[15450]: scvg23: inuse: 260, idle: 77, sys: 337, released: 0, consumed: 337 (MB)
Sep 01 19:08:12 etcd-wrapper[15450]: gc 151 @3660.900s 0%: 0.070+39+21 ms clock, 0.14+13/28/6.8+43 ms cpu, 329->331->172 MB, 337 MB goal, 2 P
Sep 01 19:08:57 etcd-wrapper[15450]: gc 152 @3705.887s 0%: 0.088+18+20 ms clock, 0.17+0.44/10/23+41 ms cpu, 332->332->171 MB, 341 MB goal, 2 P
Sep 01 19:09:33 etcd-wrapper[15450]: gc 153 @3741.403s 0%: 0.020+19+20 ms clock, 0.041+0.42/17/12+40 ms cpu, 334->334->172 MB, 343 MB goal, 2 P
Sep 01 19:09:42 etcd-wrapper[15450]: scvg24: inuse: 198, idle: 154, sys: 352, released: 0, consumed: 352 (MB)
Sep 01 19:10:33 etcd-wrapper[15450]: gc 154 @3801.531s 0%: 0.061+32+21 ms clock, 0.12+6.1/20/10+42 ms cpu, 336->338->176 MB, 345 MB goal, 2 P
Sep 01 19:11:43 etcd-wrapper[15450]: gc 155 @3871.324s 0%: 0.16+28+21 ms clock, 0.32+6.2/15/20+43 ms cpu, 340->341->177 MB, 349 MB goal, 2 P
Sep 01 19:12:12 etcd-wrapper[15450]: scvg25: inuse: 269, idle: 82, sys: 352, released: 0, consumed: 352 (MB)

@heyitsanthony
Copy link
Contributor

@danielfm what does go tool pprof -pdf ./etcd ./heap.out give? does this cluster have any v2 data?

@danielfm
Copy link
Author

danielfm commented Sep 1, 2017

@heyitsanthony here is the PDF.

I don't think this cluster has any v2 data, but is there a way I can check? Some command perhaps?

Edit: As far as I could check, this cluster was first created with etcd 3.1.3, with --storage-backend=etcd3 flag passed to the apiserver, so my guess is that this cluster does not have any v2 data.

@danielfm
Copy link
Author

danielfm commented Sep 1, 2017

More information:

I have a systemd unit that regularly takes a snapshot of the etcd data and uploads it to S3 every 60s or so. I tried disabling it for a moment, and the GC finally were able to reclaim some memory back to the OS, although not too much (~60MB):

Sep 01 19:14:42 etcd-wrapper[15450]: scvg26: inuse: 352, idle: 7, sys: 360, released: 0, consumed: 360 (MB)
Sep 01 19:14:47 etcd-wrapper[15450]: gc 158 @4055.793s 0%: 0.021+20+21 ms clock, 0.043+0.44/10/21+42 ms cpu, 352->352->181 MB, 361 MB goal, 2 P
Sep 01 19:15:36 etcd-wrapper[15450]: gc 159 @4104.985s 0%: 0.021+23+21 ms clock, 0.043+0.37/12/22+42 ms cpu, 353->354->182 MB, 363 MB goal, 2 P
Sep 01 19:16:29 etcd-wrapper[15450]: gc 160 @4157.539s 0%: 0.025+19+21 ms clock, 0.050+0.64/13/20+42 ms cpu, 356->356->184 MB, 365 MB goal, 2 P
Sep 01 19:17:12 etcd-wrapper[15450]: scvg27: inuse: 238, idle: 134, sys: 372, released: 0, consumed: 372 (MB)
Sep 01 19:17:18 etcd-wrapper[15450]: 2017-09-01 19:17:18.073462 I | mvcc: store.index: compact 40565276
Sep 01 19:17:18 etcd-wrapper[15450]: 2017-09-01 19:17:18.080707 I | mvcc: finished scheduled compaction at 40565276 (took 5.269415ms)
Sep 01 19:17:31 etcd-wrapper[15450]: gc 161 @4219.597s 0%: 0.79+42+23 ms clock, 1.5+2.8/21/13+46 ms cpu, 359->362->189 MB, 368 MB goal, 2 P
Sep 01 19:18:41 etcd-wrapper[15450]: gc 162 @4289.451s 0%: 0.029+44+22 ms clock, 0.058+10/24/17+44 ms cpu, 365->367->191 MB, 374 MB goal, 2 P
Sep 01 19:19:42 etcd-wrapper[15450]: scvg28: inuse: 291, idle: 84, sys: 376, released: 0, consumed: 376 (MB)
Sep 01 19:19:51 etcd-wrapper[15450]: gc 163 @4359.371s 0%: 0.049+32+22 ms clock, 0.098+3.8/20/12+44 ms cpu, 368->370->192 MB, 378 MB goal, 2 P
Sep 01 19:21:01 etcd-wrapper[15450]: gc 164 @4429.243s 0%: 0.15+35+23 ms clock, 0.30+7.9/21/9.5+46 ms cpu, 372->374->194 MB, 382 MB goal, 2 P
Sep 01 19:22:11 etcd-wrapper[15450]: gc 165 @4499.251s 0%: 0.022+42+23 ms clock, 0.044+7.4/21/9.9+47 ms cpu, 376->378->197 MB, 386 MB goal, 2 P
Sep 01 19:22:12 etcd-wrapper[15450]: scvg29: inuse: 279, idle: 105, sys: 384, released: 0, consumed: 384 (MB)
Sep 01 19:22:18 etcd-wrapper[15450]: 2017-09-01 19:22:18.084381 I | mvcc: store.index: compact 40567301
Sep 01 19:22:18 etcd-wrapper[15450]: 2017-09-01 19:22:18.091577 I | mvcc: finished scheduled compaction at 40567301 (took 5.227856ms)
Sep 01 19:23:18 etcd-wrapper[15450]: gc 166 @4566.828s 0%: 1.0+39+23 ms clock, 2.1+2.7/24/15+46 ms cpu, 380->382->199 MB, 390 MB goal, 2 P
Sep 01 19:24:28 etcd-wrapper[15450]: gc 167 @4636.908s 0%: 2.2+43+23 ms clock, 4.4+6.9/29/10+46 ms cpu, 384->387->201 MB, 394 MB goal, 2 P
Sep 01 19:24:42 etcd-wrapper[15450]: scvg30: inuse: 324, idle: 68, sys: 392, released: 0, consumed: 392 (MB)
Sep 01 19:25:38 etcd-wrapper[15450]: gc 168 @4706.929s 0%: 0.043+43+24 ms clock, 0.086+3.4/26/11+48 ms cpu, 388->390->203 MB, 398 MB goal, 2 P
Sep 01 19:26:48 etcd-wrapper[15450]: gc 169 @4776.897s 0%: 0.018+43+23 ms clock, 0.036+9.4/22/2.9+46 ms cpu, 391->394->205 MB, 401 MB goal, 2 P
Sep 01 19:27:12 etcd-wrapper[15450]: scvg31: inuse: 334, idle: 65, sys: 399, released: 0, consumed: 399 (MB)
Sep 01 19:27:18 etcd-wrapper[15450]: 2017-09-01 19:27:18.095349 I | mvcc: store.index: compact 40569347
Sep 01 19:27:18 etcd-wrapper[15450]: 2017-09-01 19:27:18.102561 I | mvcc: finished scheduled compaction at 40569347 (took 5.235353ms)
Sep 01 19:27:53 etcd-wrapper[15450]: gc 170 @4841.855s 0%: 0.043+47+23 ms clock, 0.086+5.1/24/10+47 ms cpu, 395->398->207 MB, 405 MB goal, 2 P
Sep 01 19:29:03 etcd-wrapper[15450]: gc 171 @4911.833s 0%: 0.028+39+24 ms clock, 0.057+3.7/27/9.7+48 ms cpu, 399->401->208 MB, 409 MB goal, 2 P
Sep 01 19:29:42 etcd-wrapper[15450]: scvg32: inuse: 369, idle: 40, sys: 409, released: 0, consumed: 409 (MB)
Sep 01 19:30:12 etcd-wrapper[15450]: gc 172 @4980.846s 0%: 0.018+34+25 ms clock, 0.037+0.55/17/21+50 ms cpu, 402->404->210 MB, 413 MB goal, 2 P
Sep 01 19:31:33 etcd-wrapper[15450]: gc 173 @5061.662s 0%: 0.64+21+23 ms clock, 1.2+0.43/11/20+47 ms cpu, 406->406->210 MB, 417 MB goal, 2 P
Sep 01 19:32:12 etcd-wrapper[15450]: scvg33: inuse: 262, idle: 150, sys: 412, released: 0, consumed: 412 (MB)
Sep 01 19:32:18 etcd-wrapper[15450]: 2017-09-01 19:32:18.108720 I | mvcc: store.index: compact 40571380
Sep 01 19:32:18 etcd-wrapper[15450]: 2017-09-01 19:32:18.115940 I | mvcc: finished scheduled compaction at 40571380 (took 5.241974ms)
Sep 01 19:33:33 etcd-wrapper[15450]: GC forced
Sep 01 19:33:33 etcd-wrapper[15450]: gc 174 @5181.714s 0%: 0.033+43+48 ms clock, 0.067+0/20/51+97 ms cpu, 354->354->214 MB, 421 MB goal, 2 P
Sep 01 19:34:42 etcd-wrapper[15450]: scvg34: inuse: 302, idle: 111, sys: 413, released: 0, consumed: 413 (MB)
Sep 01 19:35:33 etcd-wrapper[15450]: GC forced
Sep 01 19:35:33 etcd-wrapper[15450]: gc 175 @5301.812s 0%: 0.021+19+23 ms clock, 0.042+0/19/19+46 ms cpu, 350->350->217 MB, 428 MB goal, 2 P
Sep 01 19:37:12 etcd-wrapper[15450]: scvg35: 51 MB released
Sep 01 19:37:12 etcd-wrapper[15450]: scvg35: inuse: 321, idle: 92, sys: 413, released: 51, consumed: 362 (MB)
Sep 01 19:37:18 etcd-wrapper[15450]: 2017-09-01 19:37:18.116918 I | mvcc: store.index: compact 40573523
Sep 01 19:37:18 etcd-wrapper[15450]: 2017-09-01 19:37:18.124327 I | mvcc: finished scheduled compaction at 40573523 (took 5.349669ms)
Sep 01 19:37:33 etcd-wrapper[15450]: GC forced
Sep 01 19:37:33 etcd-wrapper[15450]: gc 176 @5421.871s 0%: 0.022+22+23 ms clock, 0.044+0.18/12/24+46 ms cpu, 343->344->221 MB, 435 MB goal, 2 P
Sep 01 19:39:33 etcd-wrapper[15450]: GC forced
Sep 01 19:39:33 etcd-wrapper[15450]: gc 177 @5541.931s 0%: 0.019+21+24 ms clock, 0.038+0/10/25+49 ms cpu, 367->367->224 MB, 441 MB goal, 2 P
Sep 01 19:39:42 etcd-wrapper[15450]: scvg36: 4 MB released
Sep 01 19:39:42 etcd-wrapper[15450]: scvg36: inuse: 258, idle: 155, sys: 413, released: 40, consumed: 373 (MB)
Sep 01 19:41:33 etcd-wrapper[15450]: GC forced
Sep 01 19:41:33 etcd-wrapper[15450]: gc 178 @5661.986s 0%: 0.022+22+25 ms clock, 0.045+0.22/18/21+50 ms cpu, 362->362->228 MB, 448 MB goal, 2 P
Sep 01 19:42:12 etcd-wrapper[15450]: scvg37: inuse: 285, idle: 133, sys: 419, released: 40, consumed: 378 (MB)
Sep 01 19:42:18 etcd-wrapper[15450]: 2017-09-01 19:42:18.128040 I | mvcc: store.index: compact 40575516
Sep 01 19:42:18 etcd-wrapper[15450]: 2017-09-01 19:42:18.135603 I | mvcc: finished scheduled compaction at 40575516 (took 5.213181ms)
Sep 01 19:42:52 etcd-wrapper[15450]: 2017-09-01 19:42:52.107614 I | wal: segmented wal file /var/lib/etcd/member/wal/0000000000000c38-0000000003474126.wal is created
Sep 01 19:43:12 etcd-wrapper[15450]: 2017-09-01 19:43:12.691504 I | pkg/fileutil: purged file /var/lib/etcd/member/wal/0000000000000c33-000000000345a8cb.wal successfully
Sep 01 19:43:33 etcd-wrapper[15450]: GC forced
Sep 01 19:43:33 etcd-wrapper[15450]: gc 179 @5782.057s 0%: 0.59+24+25 ms clock, 1.1+0.079/11/25+50 ms cpu, 370->370->231 MB, 457 MB goal, 2 P
Sep 01 19:44:42 etcd-wrapper[15450]: scvg38: 0 MB released
Sep 01 19:44:42 etcd-wrapper[15450]: scvg38: inuse: 319, idle: 99, sys: 419, released: 40, consumed: 379 (MB)
Sep 01 19:45:33 etcd-wrapper[15450]: GC forced
Sep 01 19:45:33 etcd-wrapper[15450]: gc 180 @5902.112s 0%: 0.56+22+24 ms clock, 1.1+0.75/10/28+48 ms cpu, 365->365->235 MB, 463 MB goal, 2 P
Sep 01 19:47:12 etcd-wrapper[15450]: scvg39: inuse: 343, idle: 75, sys: 419, released: 40, consumed: 379 (MB)
Sep 01 19:47:18 etcd-wrapper[15450]: 2017-09-01 19:47:18.142319 I | mvcc: store.index: compact 40577528
Sep 01 19:47:18 etcd-wrapper[15450]: 2017-09-01 19:47:18.149404 I | mvcc: finished scheduled compaction at 40577528 (took 5.100714ms)
Sep 01 19:47:33 etcd-wrapper[15450]: GC forced
Sep 01 19:47:34 etcd-wrapper[15450]: gc 181 @6022.166s 0%: 0.025+22+26 ms clock, 0.050+0/11/26+53 ms cpu, 367->367->238 MB, 470 MB goal, 2 P
Sep 01 19:49:34 etcd-wrapper[15450]: GC forced
Sep 01 19:49:34 etcd-wrapper[15450]: gc 182 @6142.231s 0%: 0.57+20+25 ms clock, 1.1+0.082/12/28+51 ms cpu, 388->388->241 MB, 477 MB goal, 2 P
Sep 01 19:49:42 etcd-wrapper[15450]: scvg40: 4 MB released
Sep 01 19:49:42 etcd-wrapper[15450]: scvg40: inuse: 276, idle: 142, sys: 419, released: 25, consumed: 393 (MB)
Sep 01 19:51:01 etcd-wrapper[15450]: gc 183 @6229.889s 0%: 1.0+43+27 ms clock, 2.0+4.0/30/9.3+55 ms cpu, 471->474->246 MB, 483 MB goal, 2 P
Sep 01 19:52:11 etcd-wrapper[15450]: gc 184 @6300.106s 0%: 0.047+40+27 ms clock, 0.094+8.2/29/9.6+55 ms cpu, 476->478->248 MB, 489 MB goal, 2 P
Sep 01 19:52:12 etcd-wrapper[15450]: scvg41: inuse: 302, idle: 189, sys: 492, released: 0, consumed: 492 (MB)
Sep 01 19:52:18 etcd-wrapper[15450]: 2017-09-01 19:52:18.152851 I | mvcc: store.index: compact 40579467
Sep 01 19:52:18 etcd-wrapper[15450]: 2017-09-01 19:52:18.161827 I | mvcc: finished scheduled compaction at 40579467 (took 6.003383ms)
Sep 01 19:53:22 etcd-wrapper[15450]: gc 185 @6370.496s 0%: 4.5+52+26 ms clock, 9.1+3.5/31/15+53 ms cpu, 480->484->251 MB, 493 MB goal, 2 P

One thing I noticed is that I started to see GC forced entries in the logs, which I didn't see before.

Then, after trying to re-enabling the systemd unit again, I noticed the RSS spiked again (around 19:50):

image

So, I disabled it again and I'll let it run for a few hours to see how the memory usage progresses without the snapshot.

/cc @mumoshu @redbaron

@heyitsanthony
Copy link
Contributor

You may want to try GOGC=50; the GC goal is 2x the size of the in-use heap. The snapshot operation seems to be generating a lot of garbage while encoding messages.

@danielfm
Copy link
Author

danielfm commented Sep 1, 2017

@heyitsanthony I'll try that next week

But the ultimate question is: should I worry about this, or is this expected?

@xiang90
Copy link
Contributor

xiang90 commented Sep 1, 2017

But the ultimate question is: should I worry about this, or is this expected?

it seems to me that it is GC related, and you do not really have to worry about it. But, at the mean time, we probably should start to understand this behavior better.

@danielfm
Copy link
Author

danielfm commented Sep 4, 2017

As requested, I let etcd run with GOGC=50 over the weekend, and this is what happened:

image

The variable was set around Sep 1st 8:34PM, after which the RSS peaks went down a little, but the "shape" did not change. Also, the snapshot systemd unit I mentioned earlier was not running during this test, so apparently this behavior does not seem to be caused by it.

I've just re-enabled the snapshot systemd unit; let's see if I notice anything now that I have set GOGC=50.

@redbaron
Copy link

redbaron commented Sep 4, 2017

and you do not really have to worry about it.

@xiang90 , real memory usage increased 2x, it is hardly "nothing to worry about"

@xiang90
Copy link
Contributor

xiang90 commented Sep 4, 2017

@redbaron

I do not think you read my full sentence or probably you misunderstood what I was saying.

To be clear, what I said is that an operator, do not worry about this issue that much since he/she cannot take any immediate action to solve it. And it is not critical enough to kill the cluster. This issue seems more like a GC problem/patter change. We need to understand the problem first before we can give any suggestions or fixes.

@redbaron
Copy link

redbaron commented Sep 4, 2017

@danielfm would you be able to gradually switch versions from 3.2.1 to 3.2.6 without touching anything else? given that all 3 nodes show this behaviour, it probably would be enough to do it on one node by editing 40-version.conf, then systemd daemon-reload, then restarting etcd-member service. We can narrow down to a version where this pattern start to show

@redbaron
Copy link

redbaron commented Sep 4, 2017

Oops, it is 3.1.3 where you are coming from :( furthest you can go is 3.2.0, if you are lucky, there we'll see previous memory usage pattern,then we can narrow it to exact version, if not, then at least we know it happened between 3.1.3 and 3.2.0

@danielfm
Copy link
Author

danielfm commented Sep 4, 2017

Can I downgrade a patch version just by changing the etcd version like that? If so, I'm willing to test it, although I'm not sure we'll see any difference.

(On the other hand, if I could get a Golang v1.7-linked etcd 3.2.6 instead of the Golang v1.8-linked etcd 3.2.6 I'm currently using, I think we could at least rule out changes in the GC itself as being the culprit in this case.)

@redbaron
Copy link

redbaron commented Sep 4, 2017

@danielfm , seems that you can clone etcd repo, checkout tag, change Dockerfile

FROM golang

to

FROM golang:1.7

and then docker build -t etcd-3.2.6-golang17 .

@danielfm
Copy link
Author

danielfm commented Sep 4, 2017

Don't rkt make some sort of verification on the image, meaning that this image would need to be signed by some sort of key?

@redbaron
Copy link

redbaron commented Sep 4, 2017

adding --insecure-options=image should be enough

@danielfm
Copy link
Author

danielfm commented Sep 4, 2017

@redbaron tried doing that, but etcd 3.2.6 does not compile with Go 1.7.

# github.com/coreos/etcd/pkg/transport
src/github.com/coreos/etcd/pkg/transport/listener.go:180: cfg.GetClientCertificate undefined (type *tls.Config has no field or method GetClientCertificate)
src/github.com/coreos/etcd/pkg/transport/listener.go:180: undefined: tls.CertificateRequestInfo
src/github.com/coreos/etcd/pkg/transport/listener_tls.go:178: undefined: net.DefaultResolver in net.DefaultResolver.LookupAddr
# github.com/coreos/etcd/pkg/netutil
src/github.com/coreos/etcd/pkg/netutil/netutil.go:45: undefined: net.DefaultResolver in net.DefaultResolver.LookupPort
src/github.com/coreos/etcd/pkg/netutil/netutil.go:55: undefined: net.DefaultResolver in net.DefaultResolver.LookupIPAddr
# github.com/coreos/etcd/pkg/debugutil
src/github.com/coreos/etcd/pkg/debugutil/pprof.go:28: undefined: runtime.SetMutexProfileFraction
src/github.com/coreos/etcd/pkg/debugutil/pprof.go:30: undefined: runtime.SetMutexProfileFraction
The command '/bin/sh -c go install github.com/coreos/etcd' returned a non-zero code: 2

I want to avoid changing the code just to make it compile. 😅

@redbaron
Copy link

redbaron commented Sep 4, 2017

Well, at least try it with 1.9, maybe there'll be a visible change

@danielfm
Copy link
Author

danielfm commented Sep 4, 2017

@redbaron Did what you suggested:

# Taken from the running binary via `rkt enter <pod>`
$ etcd Version: 3.2.6
Git SHA: Not provided (use ./build instead of go build)
Go Version: go1.9
Go OS/Arch: linux/amd64

The Golang 1.9-based image is available here.

I kept the GODEBUG=gctrace=2 and the --enable-pprof flag, and let the snapshot systemd unit running just like before.

So far, I didn't notice any changes in the memory usage pattern. I'll let it run for a while more just to be certain.

Edit: Nothing changed. At this time, I even tried running the stress utility to keep malloc'ing/freeing memory to see if some memory pressure on the system would change things, but nothing happened.

@danielfm
Copy link
Author

danielfm commented Sep 4, 2017

Well, I spent more time on this than I should have. If anyone has any other suggestions, please let me know and I'll try to provide the requested information as soon as I can.

@matlockx
Copy link

matlockx commented Oct 8, 2017

any news on that? we also had this issue in our pre prod environment. 3.1.5 hadn‘t had this issue. 3.2.6 went up to more than 1gbyte memory consumption.

@gyuho gyuho added this to the v3.3.0-maybe milestone Nov 10, 2017
@fanminshi
Copy link
Member

@danielfm any updates on this? are you still seeing the same issue?

@danielfm
Copy link
Author

@fanminshi I'm now running etcd v3.2.9, but the behavior remains the same.

@fanminshi
Copy link
Member

@danielfm could you show me how to reproduce this issue ? it will be nice if i can reproduce this myself so that I can help you on to debug this.

@danielfm
Copy link
Author

danielfm commented Nov 10, 2017

@fanminshi Sure.

As described previously, my etcd cluster was created by kube-aws, a tool originally authored by CoreOS.

This is the cluster.yaml file (slightly redacted to remove some things specific to my environment, like Security Group IDs, DNS names, and CIDRs) that I'm currently using to create the Kubernetes + etcd cluster that exhibits this behavior:

https://gist.github.com/danielfm/c04e808616a55de74aec2f007dc21ef4

@fanminshi
Copy link
Member

@danielfm from previous providedRPC calls per second. graph, I saw that there were around 80 range requests per second. Those are from API sever right? and How many nodes are you running in the kubenetes cluster?

@danielfm
Copy link
Author

danielfm commented Nov 10, 2017

At the time, the number was between 10-20 worker nodes (m4.large), for 2 controllers (m3.large) and 3 etcd instances (m3.large).

Currently, I'm running 58 nodes (m4.xlarge), for 4 controllers (m4.xlarge), and 3 etcd instances (m4.large), but the memory usage pattern is the same.

Edit: I can send updated graphs if that helps.

@fanminshi
Copy link
Member

@danielfm yea, if you can you send the updated graphs. it will be great. specifically, i want to see memory usage pattern and gc rate.

@danielfm
Copy link
Author

danielfm commented Nov 11, 2017

Edit: Data from the last 12h from all three instances.

rate(go_gc_duration_seconds_count{job="etcd"}[2m])
image

sum(process_resident_memory_bytes{job="etcd"}) by (instance)
image

@fanminshi
Copy link
Member

@danielfm thanks for the data. could you elaborate go_gc_duration_seconds_count. I don't quite understand the meaning of y axis. is it the number of gc calls per two minutes?

@danielfm
Copy link
Author

@fanminshi I updated the graphs, the first one is now the GC rate per second.

@redbaron
Copy link

AFAIK Go GC triggers at 2x heap size, which pretty much matches what we see on graphs. Question is why it wasn't like that before

@xiang90
Copy link
Contributor

xiang90 commented Nov 14, 2017

@redbaron

The memory usage is the total use of the machine (RSS), not just etcd. There is the baseline usage of the system itself.

Thus, I suspect etcd memory usage grows from 3.1.x to 3.2.x, so that its spike is more visible.

But I am not exactly sure. @fanminshi is looking into this.

@fanminshi
Copy link
Member

@danielfm could you explain the y-axis of the graphsum(process_resident_memory_bytes{job="etcd"}) by (instance) ? is it RSS of the entire system like xiang said or it is the RSS of etcd process?

@xiang90
Copy link
Contributor

xiang90 commented Nov 14, 2017

@fanminshi

just to be clear, I am talking about the 1st graph in the thread. some other graphs have the etcd tag on it.

@danielfm
Copy link
Author

danielfm commented Nov 14, 2017

Yes, I believe the graph plots the RSS of the entire system.

Querying only Golang metrics (roughly the same period as previous graphs):

go_memstats_heap_alloc_bytes{job="etcd"}
image

@fanminshi
Copy link
Member

@danielfm thanks for the clarification. @xiang90 you assumption is correct.

let's focus on etcd consumption only.
Could you also plot graphs for HeapReleased, Mallocs, Frees, and Sys as described here https://golang.org/src/runtime/mstats.go?s=6236:15651#L149?

@danielfm
Copy link
Author

go_memstats_heap_released_bytes_total{job="etcd"}
image

rate(go_memstats_mallocs_total{job="etcd"}[1m])
image

rate(go_memstats_frees_total{job="etcd"}[1m])
image

go_memstats_sys_bytes{job="etcd"}
image

I'm not sure if I got the right data, let me know if you need anything else.

@fanminshi
Copy link
Member

fanminshi commented Nov 21, 2017

Problem

There is a spiky memory RSS pattern when upgrading from etcd 3.1.x to 3.2.x.
We saw a similar pattern from etcd's own kubemark test where the kubemark simulates kubernete workload against etcd.

kubetmark 001

The first third shows RSS pattern for etcd Version: 3.1.10 compiled with Go Version: go1.8.5.

The middle section shows RSS pattern for etcd etcd Version: 3.2.9 compiled with Go Version: go1.8.4.

As seen from above, etcd 3.1.10 shows a flatter RSS Pattern than that 3.2.9 and also consumes less memory.

Background

etcd stores all key changes to an in-memory log, and this log is periodically truncated when it reaches to a certain size so that log doesn't increase forever.
etcd leader sends latest entries from the its in-memory log to any follower that's behind.

Analysis

etcd 3.1.x Memory Pattern

By default, 3.1.x truncates the log when 10k new entries are added starting at log idx 0.
For example:
when log idx increases from 0 -> 10000, log[0] -> log[5000] are discarded (5k discarded entries).
when log idx increases from 10000 -> 20000, log[5000] -> log[15000] are discarded (10k discarded entries).

Therefore, etcd periodically discards 5k or 10k oldest entries when the new 10k are added.

We can see the same behavior from the etcd log gathered from kubemark:

Nov 19 23:37:53 kubemark-etcd-1 etcd[19256]: etcd Version: 3.1.10+git
Nov 19 23:37:53 kubemark-etcd-1 etcd[19256]: Go Version: go1.8.5
Nov 19 23:37:53 kubemark-etcd-1 etcd[19256]: snapshot count = 10000
...
Nov 19 23:37:53 kubemark-etcd-1 etcd[19256]: gc 1 @0.045s 4%: 0.16+4.7+0.25 ms clock, 0.33+1.8/2.1/2.
4+0.50 ms cpu, 6->6->5 MB, 7 MB goal, 2 P
...
Nov 19 23:50:33 kubemark-etcd-1 etcd[19256]: gc 19 @759.899s 0%: 3.2+11+0.11 ms clock, 6.5+0.59/10/15+0.23 ms cpu, 61->61->32 MB, 62 MB goal, 2 P
Nov 19 23:50:58 kubemark-etcd-1 etcd[19256]: compacted raft log at 5001
Nov 19 23:51:28 kubemark-etcd-1 etcd[19256]: gc 20 @815.221s 0%: 0.017+16+0.088 ms clock, 0.035+0.45/13/1.8+0.17 ms cpu, 63->63->28 MB, 65 MB goal, 2 P
...
Nov 19 23:59:20 kubemark-etcd-1 etcd[19256]: gc 31 @1287.242s 0%: 1.5+8.7+0.085 ms clock, 3.0+0.15/0.12/15+0.17 ms cpu, 75->75->39 MB, 76 MB goal, 2 P
Nov 20 00:00:13 kubemark-etcd-1 etcd[19256]: compacted raft log at 15002
Nov 20 00:00:13 kubemark-etcd-1 etcd[19256]: gc 32 @1340.524s 0%: 0.033+7.1+0.093 ms clock, 0.067+0.22/6.2/6.7+0.18 ms cpu, 77->77->28 MB, 79 MB goal, 2 P

etcd increases its live heap usage from gc 1: 5 MB to gc 19: 32 MB in about 13 minutes when etcd log grows. When new 10k entries mark is reached, then etcd compacted raft log at 5001 which discards 5k entries;
the live heap decreases from gc 19: 32 MB to gc 20: 28 MB. The next log truncation happened at compacted raft log at 15002.
The live heap stays at gc 32: 28 MB which is same as previous truncation.

Because the etcd keeps its log small and discards entries often, the memory size never grows too much nor decreases too much; this reflects the RSS usage pattern of 3.1.0 from above.

etcd 3.2.x Memory Pattern

In 3.2.x, etcd truncates the log when to 100k new entries are added; pr #7160.
This means that the etcd needs to save about 100k entries in the memory and discards them when new 100k entries are added.
for example:
when log idx increases from 0 -> 100000, log[0] -> log[95000] are discarded (95k discarded entries).
when log idx increases from 100000 -> 200000, log[9501] -> log[195000] are discarded (100k discarded entries).

We can see the behavior from the etcd log gathered from kubemark:

Nov 20 19:16:31 kubemark-etcd-1 etcd[29944]: etcd Version: 3.2.9
Nov 20 19:16:31 kubemark-etcd-1 etcd[29944]: Go Version: go1.8.4
Nov 20 19:16:31 kubemark-etcd-1 etcd[29944]: snapshot count = 100000
Nov 20 19:16:31 kubemark-etcd-1 etcd[29944]: gc 1 @0.032s 1%: 0.042+1.3+0.062 ms clock, 0.084+0.068/1.0/1.1+0.12 ms cpu, 4->4->3 MB, 5 MB goal, 2 P
...
Nov 20 19:43:48 kubemark-etcd-1 etcd[29944]: gc 62 @1637.122s 0%: 0.024+41+0.14 ms clock, 0.048+0.22/21/44+0.29 ms cpu, 286->286->149 MB, 293 MB goal, 2 P
Nov 20 19:43:54 kubemark-etcd-1 etcd[29944]: compacted raft log at 95001
Nov 20 19:44:21 kubemark-etcd-1 etcd[29944]: gc 63 @1669.978s 0%: 0.016+20+0.076 ms clock, 0.033+2.5/11/24+0.15 ms cpu, 291->291->36 MB, 298 MB goal, 2 P
Nov 20 19:44:29 kubemark-etcd-1 etcd[29944]: gc 64 @1678.401s 0%: 0.030+20+0.10 ms clock, 0.061+1.0/10/29+0.20 ms cpu, 71->71->37 MB, 73 MB goal, 2 P
...
Nov 20 19:51:31 kubemark-etcd-1 etcd[29944]: scvg13: 98 MB released
Nov 20 19:51:31 kubemark-etcd-1 etcd[29944]: scvg13: inuse: 114, idle: 195, sys: 309, released: 98, consumed: 211 (MB)
...
Nov 20 20:02:42 kubemark-etcd-1 etcd[29944]: gc 118 @2770.875s 0%: 0.036+61+0.065 ms clock, 0.073+0.15/30/60+0.13 ms cpu, 284->285->150 MB, 291 MB goal, 2 P
Nov 20 20:02:51 kubemark-etcd-1 etcd[29944]: compacted raft log at 195002
Nov 20 20:03:06 kubemark-etcd-1 etcd[29944]: gc 119 @2795.794s 0%: 0.018+45+0.066 ms clock, 0.037+0.16/25/39+0.13 ms cpu, 293->293->43 MB, 300 MB goal, 2 P
Nov 20 20:03:14 kubemark-etcd-1 etcd[29944]: gc 120 @2803.327s 0%: 0.017+30+0.077 ms clock, 0.035+0.23/22/37+0.15 ms cpu, 85->85->45 MB, 87 MB goal, 2 P

etcd increases its live heap usage from gc 1: 3 MB to gc 62: 149 as etcd log increases. When new 100k entries mark is reached, etcd
compacted raft log at 95001 which discards 95k entries, the live head usage drop to gc 63: 36 MB.
In addition, go periodically releases memory back to os as indicated by scvg13: 98 MB released.

Since etcd 3.2 keeps its log large and discard entries less frequently, the memory size grows to a large size and drop heavily when large chunk of entries are being discarded;
This reflects the spiky memory usage pattern of 3.1.10 shown above.

etcd 3.2.x with 10k snapshot count

We did another experiment which changes the snapshot count of 3.2.9 to the same as the one in 3.1.x (10k). We
observe that memory usage pattern is similar to the one in 3.1.x.

the etcd log gathered from kubemark:

Nov 20 20:50:05 kubemark-etcd-1 etcd[30864]: etcd Version: 3.2.9
Nov 20 20:50:05 kubemark-etcd-1 etcd[30864]: Go Version: go1.8.4
Nov 20 20:50:05 kubemark-etcd-1 etcd[30864]: snapshot count = 10000
Nov 20 20:50:05 kubemark-etcd-1 etcd[30864]: gc 1 @0.029s 1%: 0.017+1.1+0.035 ms clock, 0.035+0.052/0.97/1.0+0.071 ms cpu, 4->4->3 MB, 5 MB goal, 2 P
...
Nov 20 20:57:43 kubemark-etcd-1 etcd[30864]: gc 20 @457.872s 0%: 0.024+10+0.053 ms clock, 0.048+0.31/10/8.1+0.10 ms cpu, 61->61->33 MB, 63 MB goal, 2 P
Nov 20 20:57:48 kubemark-etcd-1 etcd[30864]: compacted raft log at 5001
Nov 20 20:57:50 kubemark-etcd-1 etcd[30864]: gc 21 @464.775s 0%: 0.033+7.9+0.059 ms clock, 0.066+0.57/5.8/7.9+0.11 ms cpu, 64->64->28 MB, 66 MB goal, 2 P
...
Nov 20 20:58:56 kubemark-etcd-1 etcd[30864]: gc 30 @530.662s 0%: 0.053+18+0.053 ms clock, 0.10+0.30/10/9.3+0.10 ms cpu, 77->77->41 MB, 79 MB goal, 2 P
Nov 20 20:59:01 kubemark-etcd-1 etcd[30864]: compacted raft log at 15002
Nov 20 20:59:05 kubemark-etcd-1 etcd[30864]: gc 31 @539.384s 0%: 0.013+8.5+0.078 ms clock, 0.026+0.15/8.4/8.2+0.15 ms cpu, 80->80->30 MB, 82 MB goal, 2 P

from above, the live heap keeps around range of 33 MB to 41 MB and is similar to the range 28 MB to 39 MB that of 3.1.x.

The third section of Kubermark RSS graph shows this pattern.

Conclusion

The default snapshot count is change from etcd 3.1.x to etcd 3.2.x which introduces the difference in memory usage pattern.
etce 3.1.x has a small snapshot count which means that it retains a smaller in-memory log and truncates often; hence, it results low in memory growth and low memory decrease.
etcd 3.2.x has a large snapshot count which means that it retains a large in-memory log and truncates less frequently; hence, it results large memory growth and large memory decrease.
the memory pattern of etcd 3.2.x is behaving as expected; users should not worry about this as some kind of regression introduced 3.2.x.
If user wants to keep similar memory usage pattern as in 3.1.x, one can simply decrease the snapshot count to 10k via etcd --snapshot-count=10000 .

@fanminshi
Copy link
Member

Closing this since the behavior is expected. Please reopen if you have any questions!

@danielfm
Copy link
Author

@fanminshi Thank you for the detailed analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants