-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support managers external to K8s clusters #108
Conversation
After doing a self-review I think there's something I can do to greatly simplify the code Update: There was some room for improvement but not quite as much as I originally thought. We now no longer use a fixed port for workers and determining the pod IP address is a much simpler operation. |
Codecov Report
@@ Coverage Diff @@
## main #108 +/- ##
===========================================
+ Coverage 65.85% 88.44% +22.58%
===========================================
Files 4 4
Lines 164 199 +35
===========================================
+ Hits 108 176 +68
+ Misses 56 23 -33
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Attempted to remove `--bind-to` entirely and discovered that the port number does not need to be fixed. It turns out that the issue I originally ran into while experimenting with this is that the Julia was only listening to the external interface while `kubectl port-forward` tries is forwarding from the pod's localhost interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I generally understand whats going on! a few clarification questions but otherwise I'm good.
bind_addr, port = if !isk8s() | ||
# When the manager running outside of the K8s cluster we need to establish | ||
# port-forward connections from the manager to the workers. | ||
pf = open(`$(kubectl()) port-forward --address localhost pod/$pod_name :$intra_port`, "r") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the kubectl docs say that this doesn't return. Do we need to somehow keep track of the status of this process and i.e. restart it if it fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this process fails then the connection between the manager is broken and that worker will be dropped. As the port we assign on the localhost is random we couldn't just restart the same process and have things work.
I will say though as this is the connect
function if Distributed.jl did try to reconnect to the worker then this would automatically result in the port forward process becoming recreated. Unfortunately, that's not the world we live in as once this function runs once then the config.connect_at
is set and all further connections are deemed worker-to-worker connections (another thing I don't like about the Distributed.jl interface)
# Retain a reference to the port forward | ||
config.userdata.port_forward[] = pf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AH I see now, that's what this is doing...
# Stripped down and modified version of: | ||
# https://github.com/JuliaLang/julia/blob/844c20dd63870aa5b369b85038f0523d7d79308a/stdlib/Distributed/src/managers.jl#L567-L632 | ||
function Distributed.connect(manager::K8sClusterManager, pid::Int, config::WorkerConfig) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to check my own understanding, we need to provide a specialized method for this because of the need to setup port forwarding in the case of a local-to-cluster connection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct. Specifically, we could use config.connect_at
to specify that the manager connect to the workers using the local port forwarding. However, doing that results in the workers also trying to use those ephemeral addresses which fails any worker-to-worker connections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. This is actually supported but it's not well documented.
src/native_driver.jl
Outdated
function Distributed.connect(manager::K8sClusterManager, pid::Int, config::WorkerConfig) | ||
if config.connect_at !== nothing | ||
# this is a worker-to-worker setup call. | ||
return Distributed.connect_w2w(pid, config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hwo annoying would it be to add a test for this? need to launch two pods from local manager?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to launch two pods from local manager?
I'm going to delete this line as currently we're never using this. That may seem strange but by default a worker ends up calling Distributed.connect(::DefaultClusterManager, ...)
unless we overwrite init_worker
.
Replaces #94. Adds support for having the Julia cluster manager running externally from the workers running inside a cluster while retaining support for in-cluster managers.
Julia's Distributed package isn't well suited for the heterogeneous network connections we're using here so I was forced to extend the
Distributed.connect
function and modify it's behaviour for this particular cluster manager. In particular manager-to-worker connections need to use the special port-forwarded addresses while workers can continue to use intra cluster network connections.