Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Backports for 0.6.2 #24519

Merged
merged 44 commits into from
Nov 29, 2017
Merged

WIP: Backports for 0.6.2 #24519

merged 44 commits into from
Nov 29, 2017

Conversation

ararslan
Copy link
Member

@ararslan ararslan commented Nov 7, 2017

No description provided.

@ararslan ararslan added this to the 0.6.x milestone Nov 7, 2017
@ararslan
Copy link
Member Author

ararslan commented Nov 8, 2017

Looks like the builds are freezing, not sure why yet.

@fredrikekre
Copy link
Member

The following testsets does not have any output on any of the Travis bots: "core", "inference", "strings/basic", "show", "compile", "distributed"

@phaverty
Copy link
Contributor

phaverty commented Nov 8, 2017

Has 943c8e5 been considered for the list? (Perhaps 6.3?) This commit fixes a issue that requires some unfortunate work-arounds in JuliaCall and rjulia.

@ararslan
Copy link
Member Author

ararslan commented Nov 8, 2017

@phaverty, according to Yichao that commit is irrelevant. Did you mean a different one?

@ararslan
Copy link
Member Author

ararslan commented Nov 8, 2017

Looks like core stalls on https://github.com/JuliaLang/julia/blob/aa/backports-0.6.2/test/core.jl#L3286 (or at least that's what was running when I gave it a ctrl+c). Interrupting didn't even stop the process; the Julia process had to be forcibly killed.

@ararslan ararslan force-pushed the aa/backports-0.6.2 branch 2 times, most recently from 3b9ab8d to 0bcb332 Compare November 8, 2017 23:43
@phaverty
Copy link
Contributor

phaverty commented Nov 8, 2017

Wrong commit, sorry for the noise. I meant to reference this:
#23525
which is commit 5cc9898
Thanks for considering this.

@ararslan
Copy link
Member Author

ararslan commented Nov 9, 2017

Now all tests are passing on AppVeyor and Travis OS X, but the socket tests are failing on Travis Linux. Oddly enough, the socket tests pass on Nanosoldier (x86-64 Linux).

@dhoegh
Copy link
Contributor

dhoegh commented Nov 9, 2017

Could #24503 be included as it fixes a severe regression on windows?

@ararslan
Copy link
Member Author

ararslan commented Nov 9, 2017

Yes it will be.

@ararslan
Copy link
Member Author

Does anyone have any ideas about the socket tests on Travis Linux?

@ViralBShah
Copy link
Member

Seems like it should be easy to pull in #24540.

@ararslan
Copy link
Member Author

That one isn't relevant for backporting since FFTW isn't deprecated on 0.6.

@ararslan
Copy link
Member Author

There are a couple of commits marked for backport that I've requested the original authors backport since I wasn't able to determine how to do it correctly, but the current state of the PR contains nearly all of the relevant commits for 0.6.2.

@KristofferC
Copy link
Member

#24530 would be good to have in my opinion, I think it should be good to merge.

@ararslan
Copy link
Member Author

Sure. Just needs to be merged and I can backport it. 😉

@ararslan
Copy link
Member Author

Waiting on a couple more commits but might as well check performance in the meantime.

@nanosoldier runbenchmarks(ALL, vs=":release-0.6")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@ararslan
Copy link
Member Author

Hmmm. @nanosoldier runbenchmarks("array", vs=":release-0.6")

@ararslan
Copy link
Member Author

The SIGABRT on Travis Linux when building the documentation is very concerning but I can't reproduce it.

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@KristofferC
Copy link
Member

Should #24112 be included?

@maleadt
Copy link
Member

maleadt commented Nov 22, 2017

The latest state of this branch now results in additional test failures for CuArrays, CUFFT and CUBLAS. The last two seem slightly unreliable, so @MikeInnes could you look into the CuArrays test failure?

@ararslan
Copy link
Member Author

@amitmurthy Should I revert #21818 in 0.6.2 or is there a better solution?

@amitmurthy
Copy link
Contributor

amitmurthy commented Nov 24, 2017

Not the entirety of #21818. I'll provide a patch here to only revert to again only listen on a port starting from 9009. I would like a confirmation at #24722 if this was indeed the cause though.

@amitmurthy
Copy link
Contributor

amitmurthy commented Nov 24, 2017

Here it is

diff --git a/NEWS.md b/NEWS.md
index 62e2983c8e..13d26cdb4d 100644
--- a/NEWS.md
+++ b/NEWS.md
@@ -259,7 +259,8 @@ This section lists changes that do not have deprecation warnings.
     rather than from environment variables ([#19636]).
 
   * Workers now listen on an ephemeral port assigned by the OS. Previously workers would
-    listen on the first free port available from 9009 ([#21818]).
+    listen on the first free port available from 9009 ([#21818]). Version 0.6.1 only.
+    Reverted in 0.6.2
 
 
 Library improvements
diff --git a/base/distributed/cluster.jl b/base/distributed/cluster.jl
index c12f8ed847..1d1f74d7f2 100644
--- a/base/distributed/cluster.jl
+++ b/base/distributed/cluster.jl
@@ -153,7 +153,7 @@ function start_worker(out::IO, cookie::AbstractString)
     init_worker(cookie)
     interface = IPv4(LPROC.bind_addr)
     if LPROC.bind_port == 0
-        (port, sock) = listenany(interface, UInt16(0))
+        (port, sock) = listenany(interface, UInt16(9009))
         LPROC.bind_port = port
     else
         sock = listen(interface, LPROC.bind_port)
diff --git a/doc/src/manual/parallel-computing.md b/doc/src/manual/parallel-computing.md
index 8b3c15dc37..4d2fb1e97d 100644
--- a/doc/src/manual/parallel-computing.md
+++ b/doc/src/manual/parallel-computing.md
@@ -1231,8 +1231,8 @@ as local laptops, departmental clusters, or even the cloud. This section covers
 requirements for the inbuilt `LocalManager` and `SSHManager`:
 
   * The master process does not listen on any port. It only connects out to the workers.
-  * Each worker binds to only one of the local interfaces and listens on an ephemeral port number
-    assigned by the OS.
+  * Each worker binds to only one of the local interfaces and listens on the first free port starting
+    from `9009`.
   * `LocalManager`, used by `addprocs(N)`, by default binds only to the loopback interface. This means
     that workers started later on remote hosts (or by anyone with malicious intentions) are unable
     to connect to the cluster. An `addprocs(4)` followed by an `addprocs(["remote_host"])` will fail.
@@ -1250,9 +1250,8 @@ requirements for the inbuilt `LocalManager` and `SSHManager`:
     authenticated via public key infrastructure (PKI). Authentication credentials can be supplied
     via `sshflags`, for example ```sshflags=`-e <keyfile>` ```.
 
-    In an all-to-all topology (the default), all workers connect to each other via plain TCP sockets.
-    The security policy on the cluster nodes must thus ensure free connectivity between workers for
-    the ephemeral port range (varies by OS).
+    Note that worker-worker connections are still plain TCP and the local security policy on the remote
+    cluster must allow for free connections between worker nodes, at least for ports 9009 and above.
 
     Securing and encrypting all worker-worker traffic (via SSH) or encrypting individual messages
     can be done via a custom ClusterManager.

@ararslan
Copy link
Member Author

Great, thanks @amitmurthy!

@MikeInnes, any word on CuArrays? I don't have the right system to be able to test myself so I'll need some help ensuring that this PR isn't breaking for the GPU ecosystem.

@ararslan
Copy link
Member Author

@amitmurthy It looks like the tests may also have to be adjusted to accommodate that change?

@ararslan
Copy link
Member Author

I think I figured out how to fix the test: revert the changes made to test/socket.jl in #21818.

@amitmurthy
Copy link
Contributor

Hmm, no that should be a problem. What is the error you are seeing?

@ararslan
Copy link
Member Author

ararslan commented Nov 25, 2017

I think it timed out listening for a connection on port 0 rather than 9009. It was in the previous Travis macOS log, which has since been overwritten.

@ararslan
Copy link
Member Author

Looks like my fix did not fix it: https://travis-ci.org/JuliaLang/julia/jobs/306991362#L5678

@amitmurthy
Copy link
Contributor

The socket tests need to be left as is. It tests the bugfix in #21818 for listenany() with port 0.

Let me build and test this branch locally. Will take a while to build all dependencies.

@ararslan
Copy link
Member Author

Sounds good, thanks. In the meantime I've reverted the change to the tests.

@amitmurthy
Copy link
Contributor

amitmurthy commented Nov 25, 2017

Changing listenany(0) -> listenany(9009) also caused a problem with client port reuse on OSX. I have removed support for client port reuse on OSX on this branch. Still enabled for Linux.

Pushed directly to this branch.

@ViralBShah
Copy link
Member

@ararslan Can you ask @Keno for access to a GPU machine we have at JC?

@amitmurthy
Copy link
Contributor

amitmurthy commented Nov 25, 2017 via email

@MikeInnes
Copy link
Member

CuArrays passes tests for me, although with quite a few warnings about invalid debug info being generated.

I don't know if it's just a local issue, but on my usual GPU machine I can't compile this branch (it's asking me what files I want to apply some LLVM patches to).

@maleadt
Copy link
Member

maleadt commented Nov 27, 2017

CuArrays passes tests for me, although with quite a few warnings about invalid debug info being generated.

That's a change to CUDAnative which'll disappear when the version string matches 0.6.2.

I don't know if it's just a local issue, but on my usual GPU machine I can't compile this branch (it's asking me what files I want to apply some LLVM patches to).

Probably a dirty LLVM source tree from moving between eg. master and this branch, wiping deps/srccache/llvm-3.9.1 should fix that.

Partially reverts the backport of #21818 in 0.6.1. Fixes #24722.

Revert client_socket_reuse
@amitmurthy
Copy link
Contributor

@ararslan , the run was successful on Anubis. However, taking a conservative approach I reverted to both 1) listening on 9009 onwards and 2) disabling client socket reuse on all platforms.

@ararslan
Copy link
Member Author

Might as well @nanosoldier runbenchmarks(ALL, vs=":release-0.6")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@ararslan
Copy link
Member Author

Latest PkgEval results: https://gist.github.com/ararslan/0b1112ea62d9f37ac10aac6c4b624572

Package Related
DiffEqParamEstim No (Can't reproduce)
IdentityRanges Yes (Expected)
NIfTI No
Sched No (Can't reproduce)
WignerSymbols No (Can't reproduce)

Looks like this branch is now non-breaking.

@ararslan
Copy link
Member Author

Unless there are any objections or someone beats me to it, I'm going to merge this tomorrow.

@ararslan ararslan merged commit 0e505c2 into release-0.6 Nov 29, 2017
@ararslan ararslan deleted the aa/backports-0.6.2 branch November 29, 2017 19:38
@ViralBShah
Copy link
Member

That was quite some patient work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.