Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #12853: kill detached compilation process on an exception #12855

Merged
merged 2 commits into from
Sep 1, 2015

Conversation

stevengj
Copy link
Member

This fixes #12853 by the simple expedient of wrapping the compilation in a try block and making sure that things are cleaned up on exit (julia subprocess is killed and file is closed).

@stevengj stevengj added the compiler:precompilation Precompilation of modules label Aug 28, 2015
@stevengj
Copy link
Member Author

Seems to have some Travis problems... restarting Travis to see if that helps.

@stevengj
Copy link
Member Author

Weird, I'm getting There are newer queued builds for this pull request, failing early. from Travis. Maybe I'll try rebasing and force-pushing to start a new check.

@stevengj
Copy link
Member Author

I'm getting Worker 4 terminated. ERROR (unhandled task failure): EOFError: read end of file From worker 4: * examples on Linux x86_64, which is weird....

@yuyichao
Copy link
Contributor

That's the typical output for OOM error.

@stevengj
Copy link
Member Author

But why would I be having an OOM here? Grrr.

@yuyichao
Copy link
Contributor

Because we add too many tests recently?

@stevengj
Copy link
Member Author

@yuyichao, but then how do I fix it?

@kshyatt
Copy link
Contributor

kshyatt commented Aug 31, 2015

@stevengj Take away my commit access?

@stevengj
Copy link
Member Author

@kshyatt, I think we'd have to do that retroactively...

@kshyatt
Copy link
Contributor

kshyatt commented Aug 31, 2015

More seriously, it might be worth running the test suite through the allocation profiler and see where we're using a lot of memory.

@yuyichao
Copy link
Contributor

More seriously, it might be worth running the test suite through the allocation profiler and see where we're using a lot of memory.

Hint, focusing on llvm as a start.

In my original issue about Travis OOM @amitmurthy brought up the possibility to more aggressively restart worker processes. I'm not really familiar with the parallel part but it might worth doing.

@stevengj
Copy link
Member Author

In the short term, how are people getting PRs through? Should I just keep force-pushing and hope that it randomly works?

@kshyatt
Copy link
Contributor

kshyatt commented Aug 31, 2015

I sign in to Travis and restart the build.

@stevengj
Copy link
Member Author

Restarting the build from Travis yields There are newer queued builds for this pull request, failing early.

@stevengj
Copy link
Member Author

Trying rebase and push -f.

@yuyichao
Copy link
Contributor

@tkelman Is the fact that travis doesn't assign a new build number if you restart a build conflict with some logic in the travis fail fast script?

@stevengj
Copy link
Member Author

Well, the good news is that the OOM killer is predictable. The bad news is that it is predictably (3 times in a row) killing (only) the x86_64-Linux build of this branch.

@stevengj
Copy link
Member Author

Just pushed a commit to this branch that reduces the number of worker processes in runtests.jl, let me see if that works around the problem.

@stevengj
Copy link
Member Author

Oh, that won't help, the .travis.yml file already exports JULIA_CPU_CORES=4

@stevengj
Copy link
Member Author

Okay, reducing the number of workers fixed it, so it seems to definitely be the old OOM error.

Can we merge this, and increase the number of Travis cores later when a better solution to the OOM problem is found?

@stevengj
Copy link
Member Author

stevengj commented Sep 1, 2015

Going to merge this later today if there are no objections.

stevengj added a commit that referenced this pull request Sep 1, 2015
fix #12853: kill detached compilation process on an exception
@stevengj stevengj merged commit 0568edb into JuliaLang:master Sep 1, 2015
@stevengj stevengj deleted the precompile_catch branch September 1, 2015 17:02
@sbromberger
Copy link
Contributor

Thank you all for fixing this issue and working through the Travis failures!

@ScottPJones
Copy link
Contributor

What do I need to do to get my 5 code coverage tests PRs restarted now that this (hopefully) has been fixed? ( #12842 #12880 #12882 #12884 #12898 )

@sjkelly
Copy link
Contributor

sjkelly commented Sep 1, 2015

@ScottPJones I believe a rebase (probably just pull --rebase) and force push (push -f) should add the PR to the Travis queue.

@ScottPJones
Copy link
Contributor

OK, thanks!

@stevengj
Copy link
Member Author

stevengj commented Sep 1, 2015

It's not really fixed, just worked around (temporarily, until our tests get so memory-hungry that even two processes is not enough) at the cost of increased Travis latency. Please take further discussions to #11553.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:precompilation Precompilation of modules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

interrupting precompile with ^C and exiting repl doesn't stop precompilation
6 participants