Support managers external to K8s clusters #108

omus · 2023-04-13T15:59:05Z

Replaces #94. Adds support for having the Julia cluster manager running externally from the workers running inside a cluster while retaining support for in-cluster managers.

Julia's Distributed package isn't well suited for the heterogeneous network connections we're using here so I was forced to extend the Distributed.connect function and modify it's behaviour for this particular cluster manager. In particular manager-to-worker connections need to use the special port-forwarded addresses while workers can continue to use intra cluster network connections.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1204267974135853

omus · 2023-04-13T16:03:48Z

After doing a self-review I think there's something I can do to greatly simplify the code

Update: There was some room for improvement but not quite as much as I originally thought. We now no longer use a fixed port for workers and determining the pod IP address is a much simpler operation.

codecov · 2023-04-13T16:13:24Z

Codecov Report

Merging #108 (aeec2e3) into main (3fc33a8) will increase coverage by 22.58%.
The diff coverage is 79.06%.

@@             Coverage Diff             @@
##             main     #108       +/-   ##
===========================================
+ Coverage   65.85%   88.44%   +22.58%     
===========================================
  Files           4        4               
  Lines         164      199       +35     
===========================================
+ Hits          108      176       +68     
+ Misses         56       23       -33

Impacted Files	Coverage Δ
src/K8sClusterManagers.jl	`66.66% <ø> (ø)`
src/native_driver.jl	`78.94% <79.06%> (+67.28%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Attempted to remove `--bind-to` entirely and discovered that the port number does not need to be fixed. It turns out that the issue I originally ran into while experimenting with this is that the Julia was only listening to the external interface while `kubectl port-forward` tries is forwarding from the pod's localhost interface.

kleinschmidt

I think I generally understand whats going on! a few clarification questions but otherwise I'm good.

test/cluster.jl

kleinschmidt · 2023-04-13T19:04:42Z

src/native_driver.jl

+    bind_addr, port = if !isk8s()
+        # When the manager running outside of the K8s cluster we need to establish
+        # port-forward connections from the manager to the workers.
+        pf = open(`$(kubectl()) port-forward --address localhost pod/$pod_name :$intra_port`, "r")


the kubectl docs say that this doesn't return. Do we need to somehow keep track of the status of this process and i.e. restart it if it fails?

If this process fails then the connection between the manager is broken and that worker will be dropped. As the port we assign on the localhost is random we couldn't just restart the same process and have things work.

I will say though as this is the connect function if Distributed.jl did try to reconnect to the worker then this would automatically result in the port forward process becoming recreated. Unfortunately, that's not the world we live in as once this function runs once then the config.connect_at is set and all further connections are deemed worker-to-worker connections (another thing I don't like about the Distributed.jl interface)

kleinschmidt · 2023-04-13T19:05:16Z

src/native_driver.jl

+        # Retain a reference to the port forward
+        config.userdata.port_forward[] = pf


AH I see now, that's what this is doing...

kleinschmidt · 2023-04-13T19:11:11Z

src/native_driver.jl

+# Stripped down and modified version of:
+# https://github.com/JuliaLang/julia/blob/844c20dd63870aa5b369b85038f0523d7d79308a/stdlib/Distributed/src/managers.jl#L567-L632
+function Distributed.connect(manager::K8sClusterManager, pid::Int, config::WorkerConfig)


just to check my own understanding, we need to provide a specialized method for this because of the need to setup port forwarding in the case of a local-to-cluster connection?

That is correct. Specifically, we could use config.connect_at to specify that the manager connect to the workers using the local port forwarding. However, doing that results in the workers also trying to use those ephemeral addresses which fails any worker-to-worker connections.

Correct. This is actually supported but it's not well documented.

kleinschmidt · 2023-04-13T19:12:15Z

src/native_driver.jl

+function Distributed.connect(manager::K8sClusterManager, pid::Int, config::WorkerConfig)
+    if config.connect_at !== nothing
+        # this is a worker-to-worker setup call.
+        return Distributed.connect_w2w(pid, config)


hwo annoying would it be to add a test for this? need to launch two pods from local manager?

need to launch two pods from local manager?

I'm going to delete this line as currently we're never using this. That may seem strange but by default a worker ends up calling Distributed.connect(::DefaultClusterManager, ...) unless we overwrite init_worker.

omus added 7 commits April 13, 2023 10:43

Fix documentation typo

fa7ccaa

Import Distributed in README example

3587d2a

Trigger TOTP requests outside of async block

7a773e6

Add TODO about slow image pulling

b013a17

Support managers outside of K8s clusters

53055fc

Drop patch from kubectl_jll compat

aea180d

Set project version to 0.1.5

df0a45f

omus marked this pull request as ready for review April 13, 2023 17:01

omus requested a review from kleinschmidt April 13, 2023 17:01

Merge branch 'main' into cv/external-manager

2aebee3

kleinschmidt approved these changes Apr 13, 2023

View reviewed changes

omus added 3 commits April 13, 2023 14:41

Documentation update

7db3ce4

Drop unused worker-to-worker logic

ea079ce

Avoid use of nworker for testing for a single worker

aeec2e3

omus merged commit 1139430 into main Apr 13, 2023

omus deleted the cv/external-manager branch April 13, 2023 20:36

omus mentioned this pull request Apr 13, 2023

Experiment with manager external to K8s #94

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support managers external to K8s clusters #108

Support managers external to K8s clusters #108

omus commented Apr 13, 2023 •

edited

Loading

omus commented Apr 13, 2023 •

edited

Loading

codecov bot commented Apr 13, 2023 •

edited

Loading

kleinschmidt left a comment

kleinschmidt Apr 13, 2023

omus Apr 13, 2023

kleinschmidt Apr 13, 2023

kleinschmidt Apr 13, 2023

omus Apr 13, 2023

omus Apr 13, 2023

kleinschmidt Apr 13, 2023

omus Apr 13, 2023

		# Retain a reference to the port forward
		config.userdata.port_forward[] = pf

Support managers external to K8s clusters #108

Support managers external to K8s clusters #108

Conversation

omus commented Apr 13, 2023 • edited Loading

omus commented Apr 13, 2023 • edited Loading

codecov bot commented Apr 13, 2023 • edited Loading

Codecov Report

kleinschmidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omus commented Apr 13, 2023 •

edited

Loading

omus commented Apr 13, 2023 •

edited

Loading

codecov bot commented Apr 13, 2023 •

edited

Loading