-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memory leak #464
Comments
Thanks for the report @JeromeGe. I'm not 100% sure I understand the example config you have here, but I'll see if I can come up with a more manageable reproduction. |
@JeromeGe I'm getting started on this but you could probably cut out a lot of debugging time here if you could answer a couple questions:
|
Because I don't have your application code I can't reproduce your scenario exactly. You're reporting a problem with ContainerPilot itself allegedly leaking memory and not the overall container, so it shouldn't matter what the underlying applications are anyways. But I'm trying to come up with a more minimal reproduction and this is what I've done so far (without success). I wrote a quickie hello world golang server and built it for Linux ( package main
import (
"fmt"
"log"
"net/http"
)
func main() {
http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
fmt.Fprintf(w, "Hello, world")
})
log.Println("listening on 8080...")
log.Fatal(http.ListenAndServe(":8080", nil))
} Dropped it into this Dockerfile:
And used this ContainerPilot config: {
consul: "consul:8500",
jobs: [
{
name: "pjob",
port: 8080,
exec: "/bin/pjob start",
restarts: "unlimited",
health: {
exec: "curl --fail -s http://127.0.0.1:8080",
interval: 2,
ttl: 10,
timeout: "10s"
},
},
{
name: "regen-file",
exec: [
"consul-template",
"-once",
"-consul-addr",
"consul:8500",
"-template",
"/etc/containerpilot/servicesTemplate.ctmpl:/etc/containerpilot/services.json"
],
when: {
source: "watch.aaa",
each: "changed"
}
},
{
name: "send-sighup-a",
exec: "pkill -SIGHUP pjob",
when: {
source: "regen-file",
each: "exitSuccess"
}
},
],
watches: [
{
name: "aaa",
interval: 1
}
]
} Then I / # ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 13:25 ? 00:00:00 /bin/containerpilot -config=/etc/containerpilot.json
root 10 1 0 13:25 ? 00:00:01 /bin/containerpilot -config=/etc/containerpilot.json
root 111 0 0 13:27 pts/0 00:00:00 /bin/sh
root 336 10 0 13:33 ? 00:00:00 /bin/pjob start
root 350 111 0 13:33 pts/0 00:00:00 ps -ef
# pmap the supervisor
/ # pmap 1
1: /bin/containerpilot -config=/etc/containerpilot.json5
0000000000400000 3788K r-xp /bin/containerpilot
00000000007b3000 2868K r--p /bin/containerpilot
0000000000a80000 216K rw-p /bin/containerpilot
0000000000ab6000 140K rw-p [ anon ]
000000c000000000 4K rw-p [ anon ]
000000c41fff0000 64K rw-p [ anon ]
000000c420000000 2048K rw-p [ anon ]
00007f60e676a000 640K rw-p [ anon ]
00007ffc5f40d000 132K rw-p [stack]
00007ffc5f5e7000 8K r--p [vvar]
00007ffc5f5e9000 8K r-xp [vdso]
ffffffffff600000 4K r-xp [vsyscall]
mapped: 9920K
# pmap the worker
/ # pmap 10
10: /bin/containerpilot -config=/etc/containerpilot.json5
0000000000400000 3788K r-xp /bin/containerpilot
00000000007b3000 2868K r--p /bin/containerpilot
0000000000a80000 216K rw-p /bin/containerpilot
0000000000ab6000 140K rw-p [ anon ]
000000c000000000 8K rw-p [ anon ]
000000c41ffd0000 6336K rw-p [ anon ]
00007fc7efffe000 2176K rw-p [ anon ]
00007fffc763b000 132K rw-p [stack]
00007fffc77dc000 8K r--p [vvar]
00007fffc77de000 8K r-xp [vdso]
ffffffffff600000 4K r-xp [vsyscall]
mapped: 15684K
/ # free -m
total used free shared buffers cached
Mem: 1998 610 1388 163 51 391
-/+ buffers/cache: 167 1831
Swap: 964 0 964 The pmap doesn't change over time, nor does the free memory. I've |
Edit: I can't seem to make it happen except that once |
Just FYI @JeromeGe if you're in an environment where there's a Consul agent running on the underlying host, as suggested by your Edit: worse than that, nothing else in the container is even talking to that Consul agent, so it's not doing anything for you here. |
We have a Prometheus metrics endpoint and that's giving me less help than I'd expect. I wrote a script to SIGHUP the
The pmap is unchanged after running this for 100 seconds. We do see |
@JeromeGe I've rebuilt my test on the phusion base, made sure I had overlapping SIGHUPs, added items to the Consul catalog and removed them to make sure everything is getting triggered, etc, etc. and I'm still not seeing any memory leak. Can you provide some detailed instructions to reproduce? Ideally with a Dockerfile or Docker image? Also, whatever you can provide in terms of logs, the output of |
Sorry for the delay, @JeromeGe is taking a vacation right now, so I'm arranging anther colleague to provide a reproduce demo, will submit today, thanks for your help :) |
Still trying to reproduce this with a clean setup, it seems that the leak has something to do with our environment. We are running on kubernetes, with a consul cluster, dozens of rpc service instances registered and discovered by containerpilot, so it might take some time. |
@inetfuture understood on that. Please let me know if there's anything I can do to help. Even if you have the output of |
@tgross This is the containerpilot running 7 days of memory state |
@JeromeGe what I'm trying to determine is whether that is increasing over time and in what areas. I'm looking for data about 2 or more snapshots over time. It looks like there are a lot of these 1MB heap allocations, but that could very well be just that we have a bunch of goroutines with large amount of data associated with them if it isn't increasing over time. |
I'm running a soak test on this now, with the golang heap profiler running and a test script pulling a profile every 60 seconds. I patched ContainerPilot in +
+ "net/http"
+ _ "net/http/pprof" // DEBUG profiler And then at the top of the + go func() {
+ log.Println(http.ListenAndServe("0.0.0.0:6060", nil))
+ }() After building that with $ ./test.sh build
Sending build context to Docker daemon 17.37MB
Step 1/2 : FROM alpine:latest
---> 7328f6f8b418
Step 2/2 : RUN apk --no-cache add curl bash
---> Running in a699bccca185
fetch http://dl-cdn.alpinelinux.org/alpine/v3.6/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.6/community/x86_64/APKINDEX.tar.gz
(1/9) Installing ncurses-terminfo-base (6.0-r8)
(2/9) Installing ncurses-terminfo (6.0-r8)
(3/9) Installing ncurses-libs (6.0-r8)
(4/9) Installing readline (6.3.008-r5)
(5/9) Installing bash (4.3.48-r1)
Executing bash-4.3.48-r1.post-install
(6/9) Installing ca-certificates (20161130-r2)
(7/9) Installing libssh2 (1.8.0-r1)
(8/9) Installing libcurl (7.55.0-r0)
(9/9) Installing curl (7.55.0-r0)
Executing busybox-1.26.2-r5.trigger
Executing ca-certificates-20161130-r2.trigger
OK: 14 MiB in 20 packages
---> 35d4f724a50b
Removing intermediate container a699bccca185
Successfully built 35d4f724a50b
Successfully tagged leaky:latest
$ ./test.run
0371d552dae2bac125e796fa620be918d7481b9bfb8b95f91bb4f452c68b07a9
d72de5d57934b3c754aa4b902c8afa4a488a4cecdedbc2a867de6667ff70170c
60715b3df4561e03075db0952dc2f72cbb1992358ae60267068866e51014f578
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
60715b3df456 leaky "/test.sh do-profile" 2 seconds ago Up 1 second leaky_profiler
d72de5d57934 leaky "/bin/containerpil..." 3 seconds ago Up 1 second 0.0.0.0:6060->6060/tcp leaky
0371d552dae2 consul:latest "docker-entrypoint..." 3 seconds ago Up 2 seconds 8300-8302/tcp, 8500/tcp, 8301-8302/udp, 8600/tcp, 8600/udp leaky_consul Important items of note:
So far I'm only seeing periodic increases of heap allocations followed by them being GC'd, but given that the leak might be really slow over time, we'll need to do some careful examination over a long period of time. I'll let this soak all day and see what we come up with. |
I've been soaking the above test all day and I'm not seeing any evidence of a memory leak at this point. Here's a gist with the raw profile: https://gist.github.com/tgross/4492d9b83ae5f34a544026466502ef55 The only live objects:
The summary at the bottom:
There's a good chunk of // HeapIdle minus HeapReleased estimates the amount of memory
// that could be returned to the OS, but is being retained by
// the runtime so it can grow the heap without requesting more
// memory from the OS. If this difference is significantly
// larger than the heap size, it indicates there was a recent
// transient spike in live heap size. |
@JeromeGe what I'm going to do next is provide you with a build that includes the profiler I'm using above. Then if you can run that build in your existing setup I can get you to pull the heap profile after letting it soak for a while, and hopefully that'll tell me more about what's going on. |
@JeromeGe use the build I've attached. The version should read as follows: $ /containerpilot -version
Version: 3.3.0-debug-profiler
GitHash: e7fda21 What I'd like you to do give ContainerPilot a couple minutes to "warm up" (i.e. allocate as much as we expect it to for purposes of running all the jobs you've given it), and then run:
I'm not sure what you'll need for Then let it the container soak for a few hours at least and do that again. Drop both profile files into gists and link them here. They won't contain any information private to your organization. |
@tgross Has been deployed, follow-up will tell you the results. |
We have been able to reproduce this with a clean setup, https://github.com/suninuni/containerpilot-test.git , start this and wait for about 17 hours, the memory usage of containerpilot will double, from 15MB to 33MB |
@tgross This is the result. |
I've unzipped that file to a gist: https://gist.github.com/tgross/dcecad62e0848e6f6a87cdb9a6e4cd17 These are the only live heap objects when this profile was taken:
And the summary at the bottom is only showing ~7MB of heap allocations.
|
Next step for me is to take the demo at https://github.com/suninuni/containerpilot-test.git and verify the profiles there. |
I've been running this all day and taking a heap profile once a minute. The live objects in the heap profile are not increasing, but the RSS size as reported by I've verified that we're not leaking goroutines. The number here isn't increasing over time: https://gist.github.com/tgross/efaff7ca57743af1f59baffa02165594. The number associated with I've taken all heap profiles and processed them into a CSV with a little python script, and then collected that data into the chart below. The meanings of each of these elements is in the But whatever is being leaked is not showing up in the profiles. The profiler is a sampling profiler, so it's possible that the objects are small enough that they're not being hit. Here's the next few things we're going to try:
This article from Intel has been a great help by the way, if anyone following along is interested in digging into problems like this of their own. |
Update: while we had some success in getting heap dumps via So we're not quite an impasse because we have valuable data, and we just need to figure out how to interpret it. |
Maybe we should try to pin down the exact config option causing this leak, then just look at the related code? |
That's feasible if we can narrow-down the affected area, which why I was trying to get a minimal reproduction. From the repro given to us, we know there are a couple areas of the code that aren't involved:
That still leaves a lot of surface area. The next thing I'm going to try is to wrap https://github.com/golang/tools/tree/master/cmd/heapview/internal/core into a minimal CLI and then try to use it to load the heap dumps. That tool is designed to read these heap dumps but is just incomplete (as in: it never calls the code to load the heap dump!) |
Well, this turns out to be much more incomplete than I thought. So I think that's going to be a lot more work than we want to try right now. |
I'm running a test with 6 different configurations described in this gist: https://gist.github.com/tgross/1832f5bfdd4ff1f3d0e022809257b70c. I'm able to reproduce the leak in the most minimal configuration, which is a single job: {
consul: "consul:8500",
logging: { level: "INFO" },
jobs: [
{
name: "job1",
exec: "tail -F"
}
]
} When I check the heap profiles, I get a profile like the following:
Which shows that we're not allocating anything the GC is finding to collect. What's really interesting is that the |
Here's the goroutines profile for that instance: https://gist.github.com/tgross/31390d2cf2ff03566a60bede486a864f It's unchanged over several minutes, so there's no goroutines getting cycled out here (as we'd expect). I'm pretty sure everything is blocking on getting a semaphore or waiting for I/O at this point, but if I'm missing something that's probably the source of our leak. |
Nevermind, I got ahead of myself on this one -- there's no GC happening in that example because we're not hitting the |
I've been running these profiles for about 5 hours now, and I'm starting to see a signal take shape. Given this:
I'm going to continue to soak these tests for the next 5-6 hours at least, and hopefully have some signal on those remaining 2 tests. In the meantime, closely examining the potential target areas seems like a productive direction. |
Ok, with that added direction, I've found at least one point where there's a problem. When we create a new timer in I'll take a pass through the rest of the code base looking for any similar constructs, and then push up a PR with the fix and run that code through the same kind of soak test. |
I've got a build with the fix in #488 attached below, if anyone watching this thread wants to try that out. containerpilot-dev-c2377fa.tar.gz
|
I soaked the tests overnight and got some good data. First, we can see that the This is the same for the The Likewise with the new code and the |
Out of curiosity, I dug into the runtime code that implements
With this gift of hindsight, we can take a look at the core we got earlier in delve, we can dig down to where this is.
The Trying to use this toolset was a bit like finding a needle in the haystack. I imagine the ongoing efforts for |
One last note: the ever-so-slight upwards trendline in the new code in #464 (comment) goes away if we disable the |
We can confirm this bug has been fixed, no obvious leak after 16 hours running, thanks! |
I was using ContainerPilot v3.3.0 and found the ContainerPilot process memory leak.
Linux 192-168-161-100 4.4.0-79-generic #100-Ubuntu SMP Wed May 17 19:58:14 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
ubuntu 16.04
docker image: phusion/baseimage:0.9.18
ContainerPilot configuration:
pmap -d ${containerpilot PID}
dump and view ContainerPilot memory, the memory stores a large number of duplicate the following:
The text was updated successfully, but these errors were encountered: