Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support managers external to K8s clusters #108

Merged
merged 12 commits into from
Apr 13, 2023
Merged

Support managers external to K8s clusters #108

merged 12 commits into from
Apr 13, 2023

Conversation

omus
Copy link
Member

@omus omus commented Apr 13, 2023

Replaces #94. Adds support for having the Julia cluster manager running externally from the workers running inside a cluster while retaining support for in-cluster managers.

Julia's Distributed package isn't well suited for the heterogeneous network connections we're using here so I was forced to extend the Distributed.connect function and modify it's behaviour for this particular cluster manager. In particular manager-to-worker connections need to use the special port-forwarded addresses while workers can continue to use intra cluster network connections.


@omus
Copy link
Member Author

omus commented Apr 13, 2023

After doing a self-review I think there's something I can do to greatly simplify the code

Update: There was some room for improvement but not quite as much as I originally thought. We now no longer use a fixed port for workers and determining the pod IP address is a much simpler operation.

@codecov
Copy link

codecov bot commented Apr 13, 2023

Codecov Report

Merging #108 (aeec2e3) into main (3fc33a8) will increase coverage by 22.58%.
The diff coverage is 79.06%.

@@             Coverage Diff             @@
##             main     #108       +/-   ##
===========================================
+ Coverage   65.85%   88.44%   +22.58%     
===========================================
  Files           4        4               
  Lines         164      199       +35     
===========================================
+ Hits          108      176       +68     
+ Misses         56       23       -33     
Impacted Files Coverage Δ
src/K8sClusterManagers.jl 66.66% <ø> (ø)
src/native_driver.jl 78.94% <79.06%> (+67.28%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Attempted to remove `--bind-to` entirely and discovered that the port
number does not need to be fixed. It turns out that the issue I
originally ran into while experimenting with this is that the Julia was
only listening to the external interface while `kubectl port-forward`
tries is forwarding from the pod's localhost interface.
@omus omus marked this pull request as ready for review April 13, 2023 17:01
@omus omus requested a review from kleinschmidt April 13, 2023 17:01
Copy link
Member

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I generally understand whats going on! a few clarification questions but otherwise I'm good.

test/cluster.jl Outdated Show resolved Hide resolved
bind_addr, port = if !isk8s()
# When the manager running outside of the K8s cluster we need to establish
# port-forward connections from the manager to the workers.
pf = open(`$(kubectl()) port-forward --address localhost pod/$pod_name :$intra_port`, "r")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the kubectl docs say that this doesn't return. Do we need to somehow keep track of the status of this process and i.e. restart it if it fails?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this process fails then the connection between the manager is broken and that worker will be dropped. As the port we assign on the localhost is random we couldn't just restart the same process and have things work.

I will say though as this is the connect function if Distributed.jl did try to reconnect to the worker then this would automatically result in the port forward process becoming recreated. Unfortunately, that's not the world we live in as once this function runs once then the config.connect_at is set and all further connections are deemed worker-to-worker connections (another thing I don't like about the Distributed.jl interface)

Comment on lines +243 to +244
# Retain a reference to the port forward
config.userdata.port_forward[] = pf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH I see now, that's what this is doing...

Comment on lines +212 to +214
# Stripped down and modified version of:
# https://github.com/JuliaLang/julia/blob/844c20dd63870aa5b369b85038f0523d7d79308a/stdlib/Distributed/src/managers.jl#L567-L632
function Distributed.connect(manager::K8sClusterManager, pid::Int, config::WorkerConfig)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to check my own understanding, we need to provide a specialized method for this because of the need to setup port forwarding in the case of a local-to-cluster connection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. Specifically, we could use config.connect_at to specify that the manager connect to the workers using the local port forwarding. However, doing that results in the workers also trying to use those ephemeral addresses which fails any worker-to-worker connections.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This is actually supported but it's not well documented.

function Distributed.connect(manager::K8sClusterManager, pid::Int, config::WorkerConfig)
if config.connect_at !== nothing
# this is a worker-to-worker setup call.
return Distributed.connect_w2w(pid, config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hwo annoying would it be to add a test for this? need to launch two pods from local manager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to launch two pods from local manager?

I'm going to delete this line as currently we're never using this. That may seem strange but by default a worker ends up calling Distributed.connect(::DefaultClusterManager, ...) unless we overwrite init_worker.

@omus omus merged commit 1139430 into main Apr 13, 2023
@omus omus deleted the cv/external-manager branch April 13, 2023 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants