-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM killer causing the CI EOFError on linux recently #11553
Comments
Although I write it in the title, I could not grantee that this is the issue behind those failures. Also, is there a reason |
Interesting. I've seen in gdb there is a sigfpe thrown during normal operation in the numbers test, but it's supposed to get caught and handled as a Julia exception I think. |
Possibly related to #11351?? (normally caught write error to read-only memory, except when run from |
If it's repeatable with a patch, can you try bisecting, applying the patch as needed at all steps? |
Sorry I didn't notice that the signal handler was installed before my loop. Moving the loop to the front suppress the printing of The reason it is failing on my VPS is actually the OOM killer, which also sounds like a possible candidate for the CI failure. What's the configuration of the Linux CI and is there a memory log / dmesg for it? |
This is what I see in the output of dmesg.
It happens on a machine with |
We have had situations where the cause of the CI failure was probably hitting the OOM killer, but we were never absolutely positive about it. The resolution of some common-seeming failures at the time was to find some particular tests that allocated a large amount of memory and comment them out (we should find and move those to Any suggestions for how to find out if travis is hitting an OOM killer, and/or make Julia give a better backtrace when it does? I suspect the VM's don't have much memory there, but don't know enough details to determine how to debug it. We could open an issue at travis-ci/travis-ci and ask for help. |
Well, the OOM killer should leave a log in the kernel and IIRC, the return code of the process also indicate whether the process is killed. |
Well it looks like instead of propagating the return code of child processes we're ending up with |
Or if the full output of It should print sth like
|
OOM ahoy https://travis-ci.org/JuliaLang/julia/jobs/65200677 Time to bust out |
Apparently julia is not the only victim either. :P |
So what's the roadmap for solving (workaround) this issue?
And by fix I mean sth like clear the global variables and let the GC free the memory. |
We could get slightly more resources by switching to Travis' docker-based workers. But those don't allow you to use sudo, so we could no longer use the juliadeps PPA. We'd have to either download a generic nightly binary and extract the dependency so's from there, or build the deps from source and cache them (which Travis lets you do on a Docker worker). I can help with this if someone wants to start experimenting in this direction.
Worth trying. Would be a bit annoying to make Travis take twice as long as it does now, but generally we have to wait for the AppVeyor queue anyway so it might not be that bad.
This is probably the best option for now, assuming we can find which tests are allocating the most. There might be some egregious outliers. Dunno, I haven't profiled memory use in a while. You could print out memory remaining at the end of each test file as a coarse way of finding out which test is allocating the most. There might also be an underlying bug or regression in terms of memory consumption while running the tests, or maybe it's some newly added test[s] that push this over the edge to being killed more often. |
I'll probably play with the memory consumption of the tests sometime later unless someone is so annoyed and fix it, which is extremely likely :). I'm a little surprised that the |
I'll add |
Sorry I have no idea for OSX.... Someone else probably know it better (or could try it). A quick google search seems to suggest it at least exist. Edit: although I don't know if there's a OOM killer on OSX or how it works. |
this may help at least identify OOM-killed failures, ref #11553 [av skip]
Apparently it needs Thanks for editing the top post btw. |
Is this http://docs.travis-ci.com/user/speeding-up-the-build/ something to look into? Specifically, split the test suite into 4 sets and run them concurrently on 4 travis nodes. Considering that the test suite will only be added to over time. |
I just did a simple test by printing out the resident set size (RSS) after each test is finished (patch below). Clearing the globals does help a little but with 4 workers the memory consumption for each one still inevitably grows to ~1GB when the tests finishes. Is there a easy way to figure out what is taking those space? (e.g. is the GC not freeing some object? the code generated? or the GC not reusing space effeciently enough?) diff --git a/test/testdefs.jl b/test/testdefs.jl
index e7ddcda..1cb9f11 100644
--- a/test/testdefs.jl
+++ b/test/testdefs.jl
@@ -2,10 +2,38 @@
using Base.Test
+function isnotconst(m, name)
+ # taken from inference.jl
+ isdefined(m, name) && (ccall(:jl_is_const, Int32, (Any, Any), m, name) == 0)
+end
+
+function clear_globals()
+ m = current_module()
+ for name in names(m)
+ if isnotconst(m, name)
+ try
+ m.eval(:($name = nothing))
+ end
+ end
+ end
+end
+
+function print_mem_size()
+ pid = getpid()
+ open("/proc/$pid/statm") do fd
+ statm = split(readall(fd))
+ rss = parse(Int, statm[2]) * 4
+ println("Ram ($pid): $rss KB")
+ end
+end
+
function runtests(name)
@printf(" \033[1m*\033[0m \033[31m%-20s\033[0m", name)
tt = @elapsed Core.include(abspath("$name.jl"))
@printf(" in %6.2f seconds\n", tt)
+ # clear_globals()
+ # gc()
+ print_mem_size()
nothing
end |
@amitmurthy Yeah, we might want to try that, if we don't mind our queue time going up accordingly. That would also be a valid approach to try to bring back OSX Travis and hopefully keep it under the time limit. The downside of that approach is compiling the C and sysimg of Julia from source takes up a significant portion of the runtime on CI, and I think that would end up being repeated across each matrix build unless we tried some really clever caching logic. |
If the 1GB per worker is due to complied code, it may be worth trying out something like
Where each test_set is a subset of the current large number of tests. |
@amitmurthy Yeah. I guess that should fix any kind of "memory leak" as well. It should be fine to do that for the CI but I'm also interested in whether there's other problems that needs to be fixed. |
Any chances of using the "containers" infrastructure (http://docs.travis-ci.com/user/workers/container-based-infrastructure/)? More memory available (4GB rather than 3). Presumably our usage of |
Because this is an urgent issue (lots of PRs now have failing builds because of this), my feeling is that we should just set |
Not easily, we need to do a lot of apt-get work from the juliadeps PPA which we would need to get whitelisted first, and the set of i386 packages that we install to make 32-bit Linux builds work would likely be a challenge since many of them conflict with the normal 64-bit packages and I don't think Travis lets you do a matrix build with the apt-get addon (that you need to use for package installation on the container setup). In short I haven't tried this and haven't had time to look into it in detail. |
Okay, I merged |
Starts happening again.... https://travis-ci.org/JuliaLang/julia/jobs/84749848 😢 |
Re-adding the priority label if we are starting to see continual failures again... We can set |
If JC can spare some cash, would it be worth considering a paid plan? (I note that more memory does not seem to be a feature offered in the plans here, but I imagine one could ask...) There's been enough time lost to this issue that ponying up might be expedient and cost-effective. |
JC can fund this if Travis offers it and the price is reasonable. |
I asked the Travis support people, and got the following response:
|
may help with #11553 ? [av skip] Update libunwind7 to libunwind8, build openlibm from source
Close since this should be fixed by #13577 now. Add benchmark label since I think we should also have a benchmark to keep track of memory usage. (Feel free to remove the label if there's already one for this purpose.) |
Real reason is the OOM killer (see the updated title and comment below)
Original post:
With the patch below (the
/dev/tty
part might not be necessary) I've got the following backtrace on a machine that can reproduce theEOFError
issue (or what looks like it)The text was updated successfully, but these errors were encountered: