MTU change doesn't cascade to containers #848

FalconerTC · 2021-09-27T20:53:53Z

Describe the bug
I run actions-runner-controller on GKE, which uses a default MTU of 1460. I've been investigating out bound connection (npm install, apk add, etc) that have been freezing periodically. I saw the following (#385) and set

dockerMTU: 1460

For my runners, and confirmed this change

runner@ubuntu-20:/$ ifconfig | grep mtu
br-d788e7849cda: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1460
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1460

But I continue to see inconsistent outbound activity on workflows using container (https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions#jobsjob_idcontainer). It seems like MTU changes aren't applying to these containers

runner@ubuntu-20:/$ docker exec -it 88fef7adee1a4e198f47864a7acfe454_redactedname_84e321 bash

root@72de5b32b2d5:/__w/workdir/workdir# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.2/16 brd 172.18.255.255 scope global eth0
       valid_lft forever preferred_lft forever

Is there a way to use customMTU for workflows using container?

Checks

My actions-runner-controller version (v0.x.y) does support the feature
I'm using an unreleased version of the controller I built from HEAD of the default branch

The text was updated successfully, but these errors were encountered:

mumoshu · 2021-09-27T23:03:12Z

@FalconerTC DockerMTU is propagated to startup.sh and it basically sets an envvar in a supervisorg config so that dockerd hopefully reads it on launch and respects the MTU.

https://github.com/actions-runner-controller/actions-runner-controller/blob/30ab0c0b7118d9ae59c186707abc69cb71608574/runner/startup.sh#L20-L34

Probably it isn't working as expected, or there's a different setting for another use-case? 🤔
Do you by any chance know what would be an alternative approach to give dockerd a default MTU?

FalconerTC · 2021-09-27T23:08:11Z

Do you by any chance know what would be an alternative approach to give dockerd a default MTU?

I don't. I'm mostly learning about docker MTU as part of investigating this. I'm currently running with

      dockerdWithinRunnerContainer: true
      privileged: true
      dockerMTU: 1460

And can confirm those values are being set

runner@ubuntu-20:/$ cat /etc/docker/daemon.json
{
  "mtu": 1460
}
runner@ubuntu-20:/$ cat /etc/supervisor/conf.d/dockerd.conf
[program:dockerd]
command=/usr/local/bin/dockerd
autostart=true
autorestart=true
stderr_logfile=/var/log/dockerd.err.log
stdout_logfile=/var/log/dockerd.out.log
environment=DOCKERD_ROOTLESS_ROOTLESSKIT_MTU=1460

But eth0 still doesn't seem to get it

runner@ubuntu-20:/$ docker exec -it d8e9f138c8e04e918aa94ceadd6e3478_artifactoryrtrclouddockernodechromelatest_325f90 bash
root@39bb70a70f26:/__w/yams/yams# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.2/16 brd 172.18.255.255 scope global eth0
       valid_lft forever preferred_lft forever

Not sure how cap_add NET_ADMIN factors in when running as privileged?

shadiramadan · 2021-09-27T23:24:36Z

@FalconerTC We have a ticket open with Google and they seem to be working to rollback a change that might have caused issues.

Our engineering team identified a fix and a rollback is currently in progress.

They previously mentioned

Status:
Confirmed

Description:
We have identified a Networking connecting issue impacting the GKE Docker workload

How to Diagnose:
Some customers may be experiencing a connection failure in Docker workflow to Fastly destinations and may receive a timeout error.

Workaround:

Support has advised impacted customers to try adding an init container manifest into their docker in docker deployment, this ensures packets are sent with a proper MTU that will work with Fastly destinations.

FalconerTC · 2021-09-27T23:28:37Z

Oh wow. Thanks for the info here @shadiramadan . This has been driving me crazy all week

shadiramadan · 2021-09-27T23:39:22Z

@FalconerTC The fix might be out or it is the workaround... but I finally got a build to work after adding:

      initContainers:
      - name: dummy
        image: busybox:1.28
        command: ['sh', '-c', 'echo Google balckbox fix?! && sleep 1']

To my runner deployment.

shadiramadan · 2021-09-27T23:40:25Z

Only for it to fail after another run.... nvm lol

mumoshu · 2021-09-27T23:44:12Z

If that GKE-specific workaround didn't work, try this workaround:

actions/runner#775 (comment)

And also read this for more context:

moby/moby#34981 (comment)

Perhaps actions/runner is using a custom docker network. A custom docker network doesn't respect dockerd-wide MTU setting.

BTW could you share your workflow definition? I thought another folk contributed DockerMTU and the way we currently configure dockerd. I believe it works in certain cases and not work in another cases. I'm currently wondering if this might fail only when you are using service containers in your workflow.

FalconerTC · 2021-09-27T23:44:18Z

As to the MTU issue: It seems that GHA creating it's own Docker network is causing the config to be ignored. Github actions runs the following for container jobs

/usr/local/bin/docker network prune --force --filter "label=60e226"
/usr/local/bin/docker network create --label 60e226 github_network_4be02ffe394242dd99bcd2211b2b55e6
...
/usr/local/bin/docker create --name 990f38cac84f46d38ac4c31d05619374_artifactoryrtrclouddockernodechromelatest_79e14d --label 60e226 --workdir /__w/yams/yams --network github_network_4be02ffe394242dd99bcd2211b2b55e6 ...

I ran the same my self without specifying the network and confirmed the MTU gets set then

runner@ubuntu-20:/$ /usr/local/bin/docker create --name test6 --label 60e226 --workdir /__w/yams/yams -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/runner/_work":"/__w" -v "/runner/externals":"/__e":ro -v "/runner/_work/_temp":"/__w/_temp" -v "/runner/_work/_actions":"/__w/_actions" -v "/opt/hostedtoolcache":"/__t" -v "/runner/_work/_temp/_github_home":"/github/home" -v "/runner/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" <IMAGE> "-f" "/dev/null" bash
d307204ecc5622ec1e26723ac8347388a0fdd9c7ec148431ecaeb4550dd447a4
runner@ubuntu-20:/$ docker start test6
test6
runner@ubuntu-20:/$ docker ps
CONTAINER ID        IMAGE                                             COMMAND                  CREATED             STATUS              PORTS               NAMES
d307204ecc56        artifactory.rtr.cloud/docker/node-chrome:latest   "tail -f /dev/null b…"   6 seconds ago       Up 1 second                             test6
99f57b6d8d98        artifactory.rtr.cloud/docker/node-chrome:latest   "tail -f /dev/null"      3 minutes ago       Up 2 minutes                            test
8fd0260582b5        artifactory.rtr.cloud/docker/node-chrome:latest   "tail -f /dev/null"      7 minutes ago       Up 7 minutes                            990f38cac84f46d38ac4c31d05619374_artifactoryrtrclouddockernodechromelatest_79e14d
runner@ubuntu-20:/$ docker exec -it test6 bash
root@d307204ecc56:/__w/yams/yams# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
20: eth0@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

edit: The workflow contains the following:

jobs:

  eslint:
    name: Lint
    runs-on: [self-hosted, default]
    container: 
      image: artifactory.rtr.cloud/docker/node-chrome:latest
      credentials:
        username: ${{ secrets.USER }}
        password: ${{ secrets.PASS }}
    
    steps:
      - name: Checkout Repo
        uses: actions/checkout@v2
       
      - name: Get Node Version
        uses: skjnldsv/read-package-engines-version-actions@v1
        id: node

      - name: Setup node 
        uses: actions/setup-node@v2
        with:
          node-version: ${{ steps.node.outputs.nodeVersion }}
          cache: 'npm'

The current behavior is that the Setup node step usually lingers indefinitely or fails a connection immediately

kszymans · 2021-09-28T06:52:15Z

I experienced the same issue yesterday and end up with recreating whole VPC with GKE cluster in order to align MTU to 1500..

FalconerTC · 2021-09-28T16:38:08Z

I'm successfully using a workaround, based on the runner issue posted above. My modified docker wrapper script is

#!/usr/bin/env bash

# Inspired by https://github.com/actions/runner/issues/775#issuecomment-927826684

if [[ $1 = "network" ]] && [[ $2 = "create" ]] ; then
    shift; shift #pop 2 first parameters

    MTU=$(jq -r '.mtu // 1500' /etc/docker/daemon.json 2>/dev/null); 
    if [[ ! -z "$MTU" ]]; then 
        /usr/local/bin/docker.bin network create --opt com.docker.network.driver.mtu=$MTU "${@}"  
    else
        /usr/local/bin/docker.bin network create "${@}"
    fi
else
    #just call docker as normal if not network create
    /usr/local/bin/docker.bin "${@}"
fi

Waiting to hear back from a Google ticket for more info on what really went wrong here. Given the limitations of docker create network, I don't think there's anything else actions-runner-controller can do here.

shadiramadan · 2021-09-28T19:40:21Z

kubernetes/test-infra#23741 (comment)

Current ETA from Google on a fix is Thursday

ciriarte · 2021-09-28T20:14:00Z

@FalconerTC after the workaround: are you seeing annotations like these in your workflows? (Check your post actions)

Error response from daemon: error while removing network: network github_network_6eaf9ee1c81649408247d6af77329798 id 8a6aa8897caa5fcb6dc65e2b2127512c88d7299994779632fbc4663532467a9c has active endpoints
Warning: Docker network rm failed with exit code 1

FalconerTC · 2021-09-28T20:25:16Z

I'm not, no. Haven't seen any issues with it today. Here's output from a Stop Containers task on the workflow I linked earlier

Stop and remove container: 2e835be9889146d38677c36a2f7c09e1_artifactoryrtrclouddockernodechromelatest_26410e
/usr/local/bin/docker rm --force bae0166a551ea03d6a2ab130e934c2665292332cfc1f8a0133db487106716f33
bae0166a551ea03d6a2ab130e934c2665292332cfc1f8a0133db487106716f33
Remove container network: github_network_4dc6c7707c5e4a6ab795ac422f3c2395
/usr/local/bin/docker network rm github_network_4dc6c7707c5e4a6ab795ac422f3c2395
github_network_4dc6c7707c5e4a6ab795ac422f3c2395

ciriarte · 2021-09-29T07:36:58Z

I want to confirm the workaround also works for me.

FalconerTC · 2021-10-04T18:50:54Z

GKE has resolved the ticket on their side and there's a usable workaround here for any other MTU issues. Closing this.

mumoshu · 2021-10-04T23:57:36Z

@callum-tait-pbx I think this is worth being covered in our documentation as the MTU issue itself is not GKE specific

NicklasWallgren · 2022-02-02T07:26:37Z

I'm not too keen on building my own custom docker image, can't we include the work-around from #848 (comment) in the official image?

There are at least 6 different issues reported on this, https://github.com/actions-runner-controller/actions-runner-controller/issues?q=label%3A%22docker+mtu+issue%22+is%3Aclosed

mumoshu · 2022-02-02T07:34:03Z

I'm not very keen to do that either, as I consider this as something that needs to be fixed on GitHub and actions/runner side, isn't it? 🤔

NicklasWallgren · 2022-02-02T08:01:48Z

@mumoshu You're absolutely right, my mistake.

FalconerTC closed this as completed Oct 4, 2021

mumoshu added the docker mtu issue label Nov 25, 2021

mumoshu mentioned this issue Jan 11, 2022

containers create by Github workflow have wrong dockerMTU #1046

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MTU change doesn't cascade to containers #848

MTU change doesn't cascade to containers #848

FalconerTC commented Sep 27, 2021 •

edited

Loading

mumoshu commented Sep 27, 2021 •

edited

Loading

FalconerTC commented Sep 27, 2021

shadiramadan commented Sep 27, 2021

FalconerTC commented Sep 27, 2021

shadiramadan commented Sep 27, 2021

shadiramadan commented Sep 27, 2021

mumoshu commented Sep 27, 2021 •

edited

Loading

FalconerTC commented Sep 27, 2021 •

edited

Loading

kszymans commented Sep 28, 2021 •

edited

Loading

FalconerTC commented Sep 28, 2021 •

edited

Loading

shadiramadan commented Sep 28, 2021

ciriarte commented Sep 28, 2021 •

edited

Loading

FalconerTC commented Sep 28, 2021

ciriarte commented Sep 29, 2021

FalconerTC commented Oct 4, 2021

mumoshu commented Oct 4, 2021

NicklasWallgren commented Feb 2, 2022

mumoshu commented Feb 2, 2022 •

edited

Loading

NicklasWallgren commented Feb 2, 2022

MTU change doesn't cascade to containers #848

MTU change doesn't cascade to containers #848

Comments

FalconerTC commented Sep 27, 2021 • edited Loading

mumoshu commented Sep 27, 2021 • edited Loading

FalconerTC commented Sep 27, 2021

shadiramadan commented Sep 27, 2021

FalconerTC commented Sep 27, 2021

shadiramadan commented Sep 27, 2021

shadiramadan commented Sep 27, 2021

mumoshu commented Sep 27, 2021 • edited Loading

FalconerTC commented Sep 27, 2021 • edited Loading

kszymans commented Sep 28, 2021 • edited Loading

FalconerTC commented Sep 28, 2021 • edited Loading

shadiramadan commented Sep 28, 2021

ciriarte commented Sep 28, 2021 • edited Loading

FalconerTC commented Sep 28, 2021

ciriarte commented Sep 29, 2021

FalconerTC commented Oct 4, 2021

mumoshu commented Oct 4, 2021

NicklasWallgren commented Feb 2, 2022

mumoshu commented Feb 2, 2022 • edited Loading

NicklasWallgren commented Feb 2, 2022

FalconerTC commented Sep 27, 2021 •

edited

Loading

mumoshu commented Sep 27, 2021 •

edited

Loading

mumoshu commented Sep 27, 2021 •

edited

Loading

FalconerTC commented Sep 27, 2021 •

edited

Loading

kszymans commented Sep 28, 2021 •

edited

Loading

FalconerTC commented Sep 28, 2021 •

edited

Loading

ciriarte commented Sep 28, 2021 •

edited

Loading

mumoshu commented Feb 2, 2022 •

edited

Loading