Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MTU change doesn't cascade to containers #848

Closed
2 tasks
FalconerTC opened this issue Sep 27, 2021 · 19 comments
Closed
2 tasks

MTU change doesn't cascade to containers #848

FalconerTC opened this issue Sep 27, 2021 · 19 comments

Comments

@FalconerTC
Copy link

FalconerTC commented Sep 27, 2021

Describe the bug
I run actions-runner-controller on GKE, which uses a default MTU of 1460. I've been investigating out bound connection (npm install, apk add, etc) that have been freezing periodically. I saw the following (#385) and set

dockerMTU: 1460

For my runners, and confirmed this change

runner@ubuntu-20:/$ ifconfig | grep mtu
br-d788e7849cda: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
docker0: flags=4099<UP,BROADCAST,MULTICAST>  mtu 1460
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1460

But I continue to see inconsistent outbound activity on workflows using container (https://docs.github.com/en/actions/learn-github-actions/workflow-syntax-for-github-actions#jobsjob_idcontainer). It seems like MTU changes aren't applying to these containers

runner@ubuntu-20:/$ docker exec -it 88fef7adee1a4e198f47864a7acfe454_redactedname_84e321 bash

root@72de5b32b2d5:/__w/workdir/workdir# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.2/16 brd 172.18.255.255 scope global eth0
       valid_lft forever preferred_lft forever

Is there a way to use customMTU for workflows using container?

Checks

  • My actions-runner-controller version (v0.x.y) does support the feature
  • I'm using an unreleased version of the controller I built from HEAD of the default branch
@mumoshu
Copy link
Collaborator

mumoshu commented Sep 27, 2021

@FalconerTC DockerMTU is propagated to startup.sh and it basically sets an envvar in a supervisorg config so that dockerd hopefully reads it on launch and respects the MTU.

https://github.com/actions-runner-controller/actions-runner-controller/blob/30ab0c0b7118d9ae59c186707abc69cb71608574/runner/startup.sh#L20-L34

Probably it isn't working as expected, or there's a different setting for another use-case? 🤔
Do you by any chance know what would be an alternative approach to give dockerd a default MTU?

@FalconerTC
Copy link
Author

Do you by any chance know what would be an alternative approach to give dockerd a default MTU?

I don't. I'm mostly learning about docker MTU as part of investigating this. I'm currently running with

      dockerdWithinRunnerContainer: true
      privileged: true
      dockerMTU: 1460

And can confirm those values are being set

runner@ubuntu-20:/$ cat /etc/docker/daemon.json
{
  "mtu": 1460
}
runner@ubuntu-20:/$ cat /etc/supervisor/conf.d/dockerd.conf
[program:dockerd]
command=/usr/local/bin/dockerd
autostart=true
autorestart=true
stderr_logfile=/var/log/dockerd.err.log
stdout_logfile=/var/log/dockerd.out.log
environment=DOCKERD_ROOTLESS_ROOTLESSKIT_MTU=1460

But eth0 still doesn't seem to get it

runner@ubuntu-20:/$ docker exec -it d8e9f138c8e04e918aa94ceadd6e3478_artifactoryrtrclouddockernodechromelatest_325f90 bash
root@39bb70a70f26:/__w/yams/yams# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
6: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
    link/ether 02:42:ac:12:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.18.0.2/16 brd 172.18.255.255 scope global eth0
       valid_lft forever preferred_lft forever

Not sure how cap_add NET_ADMIN factors in when running as privileged?

@shadiramadan
Copy link

@FalconerTC We have a ticket open with Google and they seem to be working to rollback a change that might have caused issues.

Our engineering team identified a fix and a rollback is currently in progress.

They previously mentioned

Status:
Confirmed

Description:
We have identified a Networking connecting issue impacting the GKE Docker workload

How to Diagnose:
Some customers may be experiencing a connection failure in Docker workflow to Fastly destinations and may receive a timeout error.

Workaround:

Support has advised impacted customers to try adding an init container manifest into their docker in docker deployment, this ensures packets are sent with a proper MTU that will work with Fastly destinations.

@FalconerTC
Copy link
Author

Oh wow. Thanks for the info here @shadiramadan . This has been driving me crazy all week

@shadiramadan
Copy link

@FalconerTC The fix might be out or it is the workaround... but I finally got a build to work after adding:

      initContainers:
      - name: dummy
        image: busybox:1.28
        command: ['sh', '-c', 'echo Google balckbox fix?! && sleep 1']

To my runner deployment.

@shadiramadan
Copy link

Only for it to fail after another run.... nvm lol

@mumoshu
Copy link
Collaborator

mumoshu commented Sep 27, 2021

If that GKE-specific workaround didn't work, try this workaround:

actions/runner#775 (comment)

And also read this for more context:

moby/moby#34981 (comment)

Perhaps actions/runner is using a custom docker network. A custom docker network doesn't respect dockerd-wide MTU setting.

BTW could you share your workflow definition? I thought another folk contributed DockerMTU and the way we currently configure dockerd. I believe it works in certain cases and not work in another cases. I'm currently wondering if this might fail only when you are using service containers in your workflow.

@FalconerTC
Copy link
Author

FalconerTC commented Sep 27, 2021

As to the MTU issue: It seems that GHA creating it's own Docker network is causing the config to be ignored. Github actions runs the following for container jobs

/usr/local/bin/docker network prune --force --filter "label=60e226"
/usr/local/bin/docker network create --label 60e226 github_network_4be02ffe394242dd99bcd2211b2b55e6
...
/usr/local/bin/docker create --name 990f38cac84f46d38ac4c31d05619374_artifactoryrtrclouddockernodechromelatest_79e14d --label 60e226 --workdir /__w/yams/yams --network github_network_4be02ffe394242dd99bcd2211b2b55e6 ...

I ran the same my self without specifying the network and confirmed the MTU gets set then

runner@ubuntu-20:/$ /usr/local/bin/docker create --name test6 --label 60e226 --workdir /__w/yams/yams -e "HOME=/github/home" -e GITHUB_ACTIONS=true -e CI=true -v "/var/run/docker.sock":"/var/run/docker.sock" -v "/runner/_work":"/__w" -v "/runner/externals":"/__e":ro -v "/runner/_work/_temp":"/__w/_temp" -v "/runner/_work/_actions":"/__w/_actions" -v "/opt/hostedtoolcache":"/__t" -v "/runner/_work/_temp/_github_home":"/github/home" -v "/runner/_work/_temp/_github_workflow":"/github/workflow" --entrypoint "tail" <IMAGE> "-f" "/dev/null" bash
d307204ecc5622ec1e26723ac8347388a0fdd9c7ec148431ecaeb4550dd447a4
runner@ubuntu-20:/$ docker start test6
test6
runner@ubuntu-20:/$ docker ps
CONTAINER ID        IMAGE                                             COMMAND                  CREATED             STATUS              PORTS               NAMES
d307204ecc56        artifactory.rtr.cloud/docker/node-chrome:latest   "tail -f /dev/null b…"   6 seconds ago       Up 1 second                             test6
99f57b6d8d98        artifactory.rtr.cloud/docker/node-chrome:latest   "tail -f /dev/null"      3 minutes ago       Up 2 minutes                            test
8fd0260582b5        artifactory.rtr.cloud/docker/node-chrome:latest   "tail -f /dev/null"      7 minutes ago       Up 7 minutes                            990f38cac84f46d38ac4c31d05619374_artifactoryrtrclouddockernodechromelatest_79e14d
runner@ubuntu-20:/$ docker exec -it test6 bash
root@d307204ecc56:/__w/yams/yams# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
20: eth0@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1460 qdisc noqueue state UP group default
    link/ether 02:42:ac:11:00:02 brd ff:ff:ff:ff:ff:ff
    inet 172.17.0.2/16 brd 172.17.255.255 scope global eth0
       valid_lft forever preferred_lft forever

edit: The workflow contains the following:

jobs:

  eslint:
    name: Lint
    runs-on: [self-hosted, default]
    container: 
      image: artifactory.rtr.cloud/docker/node-chrome:latest
      credentials:
        username: ${{ secrets.USER }}
        password: ${{ secrets.PASS }}
    
    steps:
      - name: Checkout Repo
        uses: actions/checkout@v2
       
      - name: Get Node Version
        uses: skjnldsv/read-package-engines-version-actions@v1
        id: node

      - name: Setup node 
        uses: actions/setup-node@v2
        with:
          node-version: ${{ steps.node.outputs.nodeVersion }}
          cache: 'npm'

The current behavior is that the Setup node step usually lingers indefinitely or fails a connection immediately

@kszymans
Copy link

kszymans commented Sep 28, 2021

I experienced the same issue yesterday and end up with recreating whole VPC with GKE cluster in order to align MTU to 1500..

@FalconerTC
Copy link
Author

FalconerTC commented Sep 28, 2021

I'm successfully using a workaround, based on the runner issue posted above. My modified docker wrapper script is

#!/usr/bin/env bash

# Inspired by https://github.com/actions/runner/issues/775#issuecomment-927826684

if [[ $1 = "network" ]] && [[ $2 = "create" ]] ; then
    shift; shift #pop 2 first parameters

    MTU=$(jq -r '.mtu // 1500' /etc/docker/daemon.json 2>/dev/null); 
    if [[ ! -z "$MTU" ]]; then 
        /usr/local/bin/docker.bin network create --opt com.docker.network.driver.mtu=$MTU "${@}"  
    else
        /usr/local/bin/docker.bin network create "${@}"
    fi
else
    #just call docker as normal if not network create
    /usr/local/bin/docker.bin "${@}"
fi

Waiting to hear back from a Google ticket for more info on what really went wrong here. Given the limitations of docker create network, I don't think there's anything else actions-runner-controller can do here.

@shadiramadan
Copy link

kubernetes/test-infra#23741 (comment)

Current ETA from Google on a fix is Thursday

@ciriarte
Copy link

ciriarte commented Sep 28, 2021

@FalconerTC after the workaround: are you seeing annotations like these in your workflows? (Check your post actions)

Error response from daemon: error while removing network: network github_network_6eaf9ee1c81649408247d6af77329798 id 8a6aa8897caa5fcb6dc65e2b2127512c88d7299994779632fbc4663532467a9c has active endpoints
Warning: Docker network rm failed with exit code 1

@FalconerTC
Copy link
Author

I'm not, no. Haven't seen any issues with it today. Here's output from a Stop Containers task on the workflow I linked earlier

Stop and remove container: 2e835be9889146d38677c36a2f7c09e1_artifactoryrtrclouddockernodechromelatest_26410e
/usr/local/bin/docker rm --force bae0166a551ea03d6a2ab130e934c2665292332cfc1f8a0133db487106716f33
bae0166a551ea03d6a2ab130e934c2665292332cfc1f8a0133db487106716f33
Remove container network: github_network_4dc6c7707c5e4a6ab795ac422f3c2395
/usr/local/bin/docker network rm github_network_4dc6c7707c5e4a6ab795ac422f3c2395
github_network_4dc6c7707c5e4a6ab795ac422f3c2395

@ciriarte
Copy link

I want to confirm the workaround also works for me.

@FalconerTC
Copy link
Author

GKE has resolved the ticket on their side and there's a usable workaround here for any other MTU issues. Closing this.

@mumoshu
Copy link
Collaborator

mumoshu commented Oct 4, 2021

@callum-tait-pbx I think this is worth being covered in our documentation as the MTU issue itself is not GKE specific

@NicklasWallgren
Copy link

I'm not too keen on building my own custom docker image, can't we include the work-around from #848 (comment) in the official image?

There are at least 6 different issues reported on this, https://github.com/actions-runner-controller/actions-runner-controller/issues?q=label%3A%22docker+mtu+issue%22+is%3Aclosed

@mumoshu
Copy link
Collaborator

mumoshu commented Feb 2, 2022

I'm not very keen to do that either, as I consider this as something that needs to be fixed on GitHub and actions/runner side, isn't it? 🤔

@NicklasWallgren
Copy link

@mumoshu You're absolutely right, my mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants