OOM killer causing the CI EOFError on linux recently #11553

yuyichao · 2015-06-03T03:28:05Z

Real reason is the OOM killer (see the updated title and comment below)

Original post:

With the patch below (the /dev/tty part might not be necessary) I've got the following backtrace on a machine that can reproduce the EOFError issue (or what looks like it)

signal (8): Floating point exception
__gmp_exception at /usr/lib/libgmp.so (unknown line)
unknown function (ip: 1705129166)
__gmpz_powm at /usr/lib/libgmp.so (unknown line)
powermod at ./gmp.jl:424
powermod at ./gmp.jl:430
unknown function (ip: 1399079124)
jl_apply_generic at /home/yuyichao/project/julia/src/gf.c:1663
anonymous at ./test.jl:105
do_test_throws at ./test.jl:59
jl_apply_generic at /home/yuyichao/project/julia/src/gf.c:1663
do_call at /home/yuyichao/project/julia/src/interpreter.c:66
eval at /home/yuyichao/project/julia/src/interpreter.c:212
jl_toplevel_eval_flex at /home/yuyichao/project/julia/src/toplevel.c:539
jl_toplevel_eval_flex at /home/yuyichao/project/julia/src/toplevel.c:568
jl_load at /home/yuyichao/project/julia/src/toplevel.c:615
include at ./boot.jl:253
jl_apply_generic at /home/yuyichao/project/julia/src/gf.c:1663
runtests at /home/yuyichao/project/julia/test/testdefs.jl:198
unknown function (ip: 1711445219)
jl_apply_generic at /home/yuyichao/project/julia/src/gf.c:1663
jl_f_apply at /home/yuyichao/project/julia/src/builtins.c:473
anonymous at ./multi.jl:854
run_work_thunk at ./multi.jl:605
jl_apply_generic at /home/yuyichao/project/julia/src/gf.c:1663
anonymous at ./multi.jl:854
start_task at /home/yuyichao/project/julia/src/task.c:234
unknown function (ip: 0)

diff --git a/src/init.c b/src/init.c
index 06a58c7..a394b31 100644
--- a/src/init.c
+++ b/src/init.c
@@ -249,6 +249,12 @@ static int is_addr_on_stack(void *addr)

 void sigdie_handler(int sig, siginfo_t *info, void *context)
 {
+    int fd = open("/dev/tty", O_RDWR);
+    int oldfd1 = dup(1);
+    int oldfd2 = dup(2);
+    dup2(fd, 1);
+    dup2(fd, 2);
+    close(fd);
     if (sig != SIGINFO) {
         sigset_t sset;
         uv_tty_reset_mode();
@@ -269,6 +275,10 @@ void sigdie_handler(int sig, siginfo_t *info, void *context)
         sig != SIGINFO) {
         raise(sig);
     }
+    dup2(oldfd1, 1);
+    dup2(oldfd2, 2);
+    close(oldfd1);
+    close(oldfd2);
 }
 #endif

@@ -1201,6 +1211,9 @@ void jl_install_default_signal_handlers(void)
     sigemptyset(&act_die.sa_mask);
     act_die.sa_sigaction = sigdie_handler;
     act_die.sa_flags = SA_SIGINFO;
+    for (int i = 0;i < 32;i++) {
+        sigaction(i, &act_die, NULL);
+    }
     if (sigaction(SIGINFO, &act_die, NULL) < 0) {
         jl_errorf("fatal error: sigaction: %s\n", strerror(errno));
     }

The text was updated successfully, but these errors were encountered:

yuyichao · 2015-06-03T03:29:39Z

Although I write it in the title, I could not grantee that this is the issue behind those failures.

Also, is there a reason SIGFPE is only handled for windows?

tkelman · 2015-06-03T04:20:14Z

Interesting. I've seen in gdb there is a sigfpe thrown during normal operation in the numbers test, but it's supposed to get caught and handled as a Julia exception I think.

quinnj · 2015-06-03T06:13:36Z

Possibly related to #11351?? (normally caught write error to read-only memory, except when run from include)

tkelman · 2015-06-03T06:15:14Z

If it's repeatable with a patch, can you try bisecting, applying the patch as needed at all steps?

yuyichao · 2015-06-03T06:45:28Z

Sorry I didn't notice that the signal handler was installed before my loop. Moving the loop to the front suppress the printing of SIGFPE. So I'm closing this for now.

The reason it is failing on my VPS is actually the OOM killer, which also sounds like a possible candidate for the CI failure. What's the configuration of the Linux CI and is there a memory log / dmesg for it?

yuyichao · 2015-06-03T06:49:02Z

This is what I see in the output of dmesg.

[7464878.811504] Out of memory: Kill process 18793 (julia) score 156 or sacrifice child
[7464878.811509] Killed process 18793 (julia) total-vm:9373516kB, anon-rss:656052kB, file-rss:2452kB

It happens on a machine with 4G of memory (although ~1G is used by other processes)

tkelman · 2015-06-03T06:56:01Z

We have had situations where the cause of the CI failure was probably hitting the OOM killer, but we were never absolutely positive about it. The resolution of some common-seeming failures at the time was to find some particular tests that allocated a large amount of memory and comment them out (we should find and move those to test/perf instead, or make a new test/torture directory).

Any suggestions for how to find out if travis is hitting an OOM killer, and/or make Julia give a better backtrace when it does? I suspect the VM's don't have much memory there, but don't know enough details to determine how to debug it. We could open an issue at travis-ci/travis-ci and ask for help.

yuyichao · 2015-06-03T06:58:29Z

Well, the OOM killer should leave a log in the kernel and dmesg at the end of the build should print it.

IIRC, the return code of the process also indicate whether the process is killed.

tkelman · 2015-06-03T07:01:10Z

Well it looks like instead of propagating the return code of child processes we're ending up with EOFError so something isn't working right. I've never used dmesg before but I'll make a test branch tk/dmesg where I output that at the end of Travis, see what happens.

yuyichao · 2015-06-03T07:01:27Z

Or if the full output of dmesg is too much. dmesg | grep -i kill should be enough.

It should print sth like

[7465405.462999] dhcpcd-run-hook invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[7465405.463109]  [<ffffffff81172cc6>] ? oom_kill_process+0xbe/0x380
[7465405.463839] Out of memory: Kill process 19094 (julia) score 155 or sacrifice child
[7465405.463845] Killed process 19094 (julia) total-vm:9401780kB, anon-rss:654728kB, file-rss:1636kB

tkelman · 2015-06-03T07:50:34Z

OOM ahoy https://travis-ci.org/JuliaLang/julia/jobs/65200677

Time to bust out --track-allocation ?

yuyichao · 2015-06-03T07:52:08Z

Apparently julia is not the only victim either. :P

yuyichao · 2015-06-03T12:43:37Z

So what's the roadmap for solving (workaround) this issue?

Get more ram?
Run less processes?
Fix/move/remove tests that allocates a lot of memory?

And by fix I mean sth like clear the global variables and let the GC free the memory.

tkelman · 2015-06-03T12:59:34Z

Get more ram?

We could get slightly more resources by switching to Travis' docker-based workers. But those don't allow you to use sudo, so we could no longer use the juliadeps PPA. We'd have to either download a generic nightly binary and extract the dependency so's from there, or build the deps from source and cache them (which Travis lets you do on a Docker worker). I can help with this if someone wants to start experimenting in this direction.

Run less processes?

Worth trying. Would be a bit annoying to make Travis take twice as long as it does now, but generally we have to wait for the AppVeyor queue anyway so it might not be that bad.

Fix/move/remove tests that allocates a lot of memory?

This is probably the best option for now, assuming we can find which tests are allocating the most. There might be some egregious outliers. Dunno, I haven't profiled memory use in a while. You could print out memory remaining at the end of each test file as a coarse way of finding out which test is allocating the most. There might also be an underlying bug or regression in terms of memory consumption while running the tests, or maybe it's some newly added test[s] that push this over the edge to being killed more often.

yuyichao · 2015-06-03T13:23:15Z

I'll probably play with the memory consumption of the tests sometime later unless someone is so annoyed and fix it, which is extremely likely :).

I'm a little surprised that the dmesg doesn't print all/most the kernel logs, as it does on ArchLinux now, probably a combination of older version and not being root. In any case, I guess it's probably a good idea to add dmesg | grep -i kill || true or just dmesg to the master CI script as well so that people know what's wrong. (especially if the symptom changes again in the future.)

tkelman · 2015-06-03T13:27:27Z

I'll add dmesg to Travis. We're not running on osx Travis right now, but just to be sure, would that also work there?

yuyichao · 2015-06-03T13:29:42Z

Sorry I have no idea for OSX.... Someone else probably know it better (or could try it).

A quick google search seems to suggest it at least exist.

Edit: although I don't know if there's a OOM killer on OSX or how it works.

this may help at least identify OOM-killed failures, ref #11553 [av skip]

this may help at least identify OOM-killed failures, ref #11553 [av skip] (cherry picked from commit 93e672d)

tkelman · 2015-06-03T15:08:06Z

Apparently it needs sudo on osx, and I'm guessing it might be much chattier with sudo on Linux. One more thing that'll need fixing whenever we try turning OSX Travis back on for master.

Thanks for editing the top post btw.

amitmurthy · 2015-06-03T15:11:34Z

Is this http://docs.travis-ci.com/user/speeding-up-the-build/ something to look into?

Specifically, split the test suite into 4 sets and run them concurrently on 4 travis nodes. Considering that the test suite will only be added to over time.

yuyichao · 2015-06-03T15:20:28Z

I just did a simple test by printing out the resident set size (RSS) after each test is finished (patch below).

Clearing the globals does help a little but with 4 workers the memory consumption for each one still inevitably grows to ~1GB when the tests finishes.

Is there a easy way to figure out what is taking those space? (e.g. is the GC not freeing some object? the code generated? or the GC not reusing space effeciently enough?)

diff --git a/test/testdefs.jl b/test/testdefs.jl
index e7ddcda..1cb9f11 100644
--- a/test/testdefs.jl
+++ b/test/testdefs.jl
@@ -2,10 +2,38 @@

 using Base.Test

+function isnotconst(m, name)
+    # taken from inference.jl
+    isdefined(m, name) && (ccall(:jl_is_const, Int32, (Any, Any), m, name) == 0)
+end
+
+function clear_globals()
+    m = current_module()
+    for name in names(m)
+        if isnotconst(m, name)
+            try
+                m.eval(:($name = nothing))
+            end
+        end
+    end
+end
+
+function print_mem_size()
+    pid = getpid()
+    open("/proc/$pid/statm") do fd
+        statm = split(readall(fd))
+        rss = parse(Int, statm[2]) * 4
+        println("Ram ($pid): $rss KB")
+    end
+end
+
 function runtests(name)
     @printf("     \033[1m*\033[0m \033[31m%-20s\033[0m", name)
     tt = @elapsed Core.include(abspath("$name.jl"))
     @printf(" in %6.2f seconds\n", tt)
+    # clear_globals()
+    # gc()
+    print_mem_size()
     nothing
 end

tkelman · 2015-06-03T15:22:34Z

@amitmurthy Yeah, we might want to try that, if we don't mind our queue time going up accordingly. That would also be a valid approach to try to bring back OSX Travis and hopefully keep it under the time limit. The downside of that approach is compiling the C and sysimg of Julia from source takes up a significant portion of the runtime on CI, and I think that would end up being repeated across each matrix build unless we tried some really clever caching logic.

amitmurthy · 2015-06-03T15:35:25Z

If the 1GB per worker is due to complied code, it may be worth trying out something like

for test_set in test_sets
  addprocs(n)
  run test_set
  rmprocs(workers)
end

Where each test_set is a subset of the current large number of tests.

yuyichao · 2015-06-03T15:37:28Z

@amitmurthy Yeah. I guess that should fix any kind of "memory leak" as well. It should be fine to do that for the CI but I'm also interested in whether there's other problems that needs to be fixed.

…killer (JuliaLang#11553)

…aLang#11553)

timholy · 2015-08-31T21:12:37Z

Any chances of using the "containers" infrastructure (http://docs.travis-ci.com/user/workers/container-based-infrastructure/)? More memory available (4GB rather than 3). Presumably our usage of sudo makes this rather hard, however.

stevengj · 2015-09-01T00:51:27Z

Because this is an urgent issue (lots of PRs now have failing builds because of this), my feeling is that we should just set JULIA_CPU_CORES=2 (as in #12855) for now, and swallow the longer Travis times until there is a better solution.

tkelman · 2015-09-01T10:48:19Z

Any chances of using the "containers" infrastructure

Not easily, we need to do a lot of apt-get work from the juliadeps PPA which we would need to get whitelisted first, and the set of i386 packages that we install to make 32-bit Linux builds work would likely be a challenge since many of them conflict with the normal 64-bit packages and I don't think Travis lets you do a matrix build with the apt-get addon (that you need to use for package installation on the container setup). In short I haven't tried this and haven't had time to look into it in detail.

stevengj · 2015-09-01T17:03:31Z

Okay, I merged JULIA_CPU_CORES=2 as a temporary fix. Slow Travis is better than continual failures.

yuyichao · 2015-10-11T23:46:33Z

Starts happening again.... https://travis-ci.org/JuliaLang/julia/jobs/84749848 😢

stevengj · 2015-10-12T02:54:46Z

Re-adding the priority label if we are starting to see continual failures again... We can set JULIA_CPU_CORES=1, of course, but we clearly need a longer-term fix.

timholy · 2015-10-12T13:25:23Z

If JC can spare some cash, would it be worth considering a paid plan? (I note that more memory does not seem to be a feature offered in the plans here, but I imagine one could ask...) There's been enough time lost to this issue that ponying up might be expedient and cost-effective.

StefanKarpinski · 2015-10-12T14:36:20Z

JC can fund this if Travis offers it and the price is reasonable.

stevengj · 2015-10-12T16:27:00Z

I asked the Travis support people, and got the following response:

We don't offer paid plans on travis-ci.org (public repositories/builds) at the moment. This is something we are looking into, but due to the requirements this might take a while.

In the meantime, have you considered running these builds on our GCE/Trusty beta infrastructure? There are 7.5 GB of memory available.

may help with #11553 ? [av skip]

may help with #11553 ? [av skip] Update libunwind7 to libunwind8, build openlibm from source

yuyichao · 2015-11-14T04:14:17Z

Close since this should be fixed by #13577 now. Add benchmark label since I think we should also have a benchmark to keep track of memory usage. (Feel free to remove the label if there's already one for this purpose.)

tkelman added the priority This should be addressed urgently label Jun 3, 2015

yuyichao closed this as completed Jun 3, 2015

tkelman added a commit that referenced this issue Jun 3, 2015

Add dmesg to end of Travis run ref #11553

458a582

tkelman changed the title ~~Floating point exceptions which seems to be causing the CI EOFError on linux recently~~ OOM killer causing the CI EOFError on linux recently Jun 3, 2015

tkelman reopened this Jun 3, 2015

tkelman added a commit that referenced this issue Jun 3, 2015

add dmesg output to Travis

93e672d

this may help at least identify OOM-killed failures, ref #11553 [av skip]

tkelman added a commit that referenced this issue Jun 3, 2015

add dmesg output to Travis

976b676

this may help at least identify OOM-killed failures, ref #11553 [av skip] (cherry picked from commit 93e672d)

stevengj added a commit to stevengj/julia that referenced this issue Aug 31, 2015

try reducing the number of worker processes to 4 to avoid Travis OOM …

48e1449

…killer (JuliaLang#11553)

stevengj added a commit to stevengj/julia that referenced this issue Aug 31, 2015

set JULIA_CPU_CORES=2 in .travis.tml to avoid Travis OOM killer (Juli…

6090ce5

…aLang#11553)

stevengj added a commit to stevengj/julia that referenced this issue Aug 31, 2015

set JULIA_CPU_CORES=2 in .travis.yml to avoid Travis OOM killer (Juli…

d148990

…aLang#11553)

yuyichao reopened this Aug 31, 2015

This was referenced Sep 1, 2015

Add extra coverage testing for char.jl #12882

Merged

RFC: Add cygwin mintty to windows binary distribution. #12879

Closed

JeffBezanson removed the priority This should be addressed urgently label Sep 17, 2015

yuyichao mentioned this issue Oct 12, 2015

Ignore line number node when counting expressions for inlining #13553

Merged

stevengj added the priority This should be addressed urgently label Oct 12, 2015

andreasnoack mentioned this issue Oct 12, 2015

Delete importall Base in linalg, sparse and datafmt #13537

Merged

andreasnoack mentioned this issue Oct 12, 2015

RFC: Run each test in separate julia process to save memory #13567

Closed

tkelman added a commit that referenced this issue Oct 12, 2015

use sudo: required and dist: trusty on Travis

644a6c1

may help with #11553 ? [av skip]

tkelman mentioned this issue Oct 12, 2015

RFC: experimenting with docker containers on Travis #13569

Merged

tkelman added a commit that referenced this issue Oct 12, 2015

use sudo: required and dist: trusty on Travis

47c3011

may help with #11553 ? [av skip] Update libunwind7 to libunwind8, build openlibm from source

amitmurthy mentioned this issue Oct 14, 2015

restart worker during tests depending on resident memory size #13577

Merged

stevengj mentioned this issue Oct 14, 2015

Fix Base.print_matrix() for big matrices #13598

Merged

yuyichao mentioned this issue Oct 26, 2015

Improve generated debug info to avoid assertion in SROA #13762

Merged

yuyichao added the potential benchmark Could make a good benchmark in BaseBenchmarks label Nov 14, 2015

yuyichao closed this as completed Nov 14, 2015

KristofferC removed the potential benchmark Could make a good benchmark in BaseBenchmarks label Oct 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM killer causing the CI EOFError on linux recently #11553

OOM killer causing the CI EOFError on linux recently #11553

yuyichao commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

quinnj commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

amitmurthy commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

amitmurthy commented Jun 3, 2015

yuyichao commented Jun 3, 2015

timholy commented Aug 31, 2015

stevengj commented Sep 1, 2015

tkelman commented Sep 1, 2015

stevengj commented Sep 1, 2015

yuyichao commented Oct 11, 2015

stevengj commented Oct 12, 2015

timholy commented Oct 12, 2015

StefanKarpinski commented Oct 12, 2015

stevengj commented Oct 12, 2015

yuyichao commented Nov 14, 2015

OOM killer causing the CI EOFError on linux recently #11553

OOM killer causing the CI EOFError on linux recently #11553

Comments

yuyichao commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

quinnj commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

amitmurthy commented Jun 3, 2015

yuyichao commented Jun 3, 2015

tkelman commented Jun 3, 2015

amitmurthy commented Jun 3, 2015

yuyichao commented Jun 3, 2015

timholy commented Aug 31, 2015

stevengj commented Sep 1, 2015

tkelman commented Sep 1, 2015

stevengj commented Sep 1, 2015

yuyichao commented Oct 11, 2015

stevengj commented Oct 12, 2015

timholy commented Oct 12, 2015

StefanKarpinski commented Oct 12, 2015

stevengj commented Oct 12, 2015

yuyichao commented Nov 14, 2015