Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[URGENT] Reducing our usage of GitHub Runners #14376

Closed
lupyuen opened this issue Oct 17, 2024 · 70 comments · Fixed by #14377, apache/nuttx-apps#2750, #14386, apache/nuttx-apps#2753 or #14400
Closed
Assignees

Comments

@lupyuen
Copy link
Member

lupyuen commented Oct 17, 2024

Hi All: We have an ultimatum to reduce (drastically) our usage of GitHub Actions. Or our Continuous Integration will halt totally in Two Weeks. Here's what I'll implement within 24 hours for nuttx and nuttx-apps repos:

  1. When we submit or update a Complex PR that affects All Architectures (Arm, RISC-V, Xtensa, etc): CI Workflow shall run only half the jobs. Previously CI Workflow will run arm-01 to arm-14, now we will run only arm-01 to arm-07. (This will reduce GitHub Cost by 32%)

  2. When the Complex PR is Merged: CI Workflow will still run all jobs arm-01 to arm-14

    (Simple PRs with One Single Arch / Board will build the same way as before: arm-01 to arm-14)

  3. For NuttX Admins: Our Merge Jobs are now at github.com/NuttX/nuttx. We shall have only Two Scheduled Merge Jobs per day

    I shall quickly Cancel any Merge Jobs that appear in nuttx and nuttx-apps repos. Then at 00:00 UTC and 12:00 UTC: I shall start the Latest Merge Job at nuttxpr. (This will reduce GitHub Cost by 17%)

  4. macOS and Windows Jobs (msys2 / msvc): They shall be totally disabled until we find a way to manage their costs. (GitHub charges 10x premium for macOS runners, 2x premium for Windows runners!)

    Let's monitor the GitHub Cost after disabling macOS and Windows Jobs. It's possible that macOS and Windows Jobs are contributing a huge part of the cost. We could re-enable and simplify them after monitoring.

    (This must be done for BOTH nuttx and nuttx-apps repos. Sadly the ASF Report for GitHub Runners doesn't break down the usage by repo, so we'll never know how much macOS and Windows Jobs are contributing to the cost. That's why we need CI: Disable all jobs for macOS and Windows #14377)

    (Wish I could run NuttX CI Jobs on my M2 Mac Mini. But the CI Script only supports Intel Macs sigh. Buy a Refurbished Intel Mac Mini?)

We have done an Analysis of CI Jobs over the past 24 hours:

https://docs.google.com/spreadsheets/d/1ujGKmUyy-cGY-l1pDBfle_Y6LKMsNp7o3rbfT1UkiZE/edit?gid=0#gid=0

Many CI Jobs are Incomplete: We waste GitHub Runners on jobs that eventually get superseded and cancelled

Screenshot 2024-10-17 at 1 18 14 PM

When we Half the CI Jobs: We reduce the wastage of GitHub Runners

Screenshot 2024-10-17 at 1 15 30 PM

Scheduled Merge Jobs will also reduce wastage of GitHub Runners, since most Merge Jobs don't complete (only 1 completed yesterday)

Screenshot 2024-10-17 at 1 16 16 PM

See the ASF Policy for GitHub Actions

lupyuen added a commit to lupyuen2/wip-nuttx that referenced this issue Oct 17, 2024
This PR disables all CI Jobs for macOS and Windows, to reduce GitHub Cost. Details here: apache#14376
lupyuen added a commit to lupyuen2/wip-nuttx-apps that referenced this issue Oct 17, 2024
This PR disables all CI Jobs for macOS and Windows, to reduce GitHub Cost. Details here: apache/nuttx#14376
@lupyuen
Copy link
Member Author

lupyuen commented Oct 17, 2024

As commented by @xiaoxiang781216:

can we reduce the board on Linux host to keep macOS/Windows? it's very easy to break these host if without these basic coverage.

I suggest that we monitor the GitHub Cost after disabling macOS and Windows Jobs. It's possible that macOS and Windows Jobs are contributing a huge part of the cost. We could re-enable and simplify them after monitoring.

@raiden00pl
Copy link
Member

One of the methods proposed by, if I remember correctly @btashton, is to replace many simple configurations for some boards (mostly for peripherals testing) with one large jumbo config activating everything possible.
This won't work for chips with low memory, but it will save some CI resources anyway.

@lupyuen
Copy link
Member Author

lupyuen commented Oct 17, 2024

@raiden00pl Yep I agree. Or we could test a complex target like board:lvgl?

@lupyuen
Copy link
Member Author

lupyuen commented Oct 17, 2024

Here's another comment about macOS and Windows by @yamt: #14377 (comment)

@yamt
Copy link
Contributor

yamt commented Oct 17, 2024

sorry, let me ask a dumb question.
what plan are we using? https://github.com/pricing
is apache paying for it?

@lupyuen
Copy link
Member Author

lupyuen commented Oct 17, 2024

what plan are we using? https://github.com/pricing

@yamt It's probably a special plan negotiated by ASF and GitHub? It's not mentioned in the ASF Policy for GitHub Actions: https://infra.apache.org/github-actions-policy.html

I find this "contract" a little strange. Why are all ASF Projects subjected to the same quotas? And why can't we increase the quota if we happen to have additional funding?

Update: More info here: https://cwiki.apache.org/confluence/display/INFRA/GitHub+self-hosted+runners

If your project uses GitHub Actions, you share a queue with all other Apache projects using Github Actions, which can quickly lead to frustration for everyone involved. Builds can be stuck in "queued" for 6+ hours.

One option (if you want to stick with GitHub and don't want to use the Infra-managed Jenkins) is for your project to create its own self-hosted runners, which means your jobs will run on a virtual machine (VM) under your project's control. However this is not something to tackle lightly, as Infra will not manage or secure your VM - that is up to you.

Update 2: This sounds really complicated. I'd rather use my own Mac Mini to execute the NuttX CI Tests, once a day?

@yamt
Copy link
Contributor

yamt commented Oct 17, 2024

what plan are we using? https://github.com/pricing

@yamt It's probably a special plan negotiated by ASF and GitHub? It's not mentioned in the ASF Policy for GitHub Actions: https://infra.apache.org/github-actions-policy.html

do you know if the macos/windows premium applies as usual?
the policy page seems to have no mention about it.

I find this "contract" a little strange. Why are all ASF Projects subjected to the same quotas? And why can't we increase the quota if we happen to have additional funding?

yea, i guess projects have very different sizes/demands.
(i feel nuttx is using too much anyway though :-)

@TimJTi
Copy link
Contributor

TimJTi commented Oct 17, 2024

...I'd rather use my own Mac Mini to execute the NuttX CI Tests, once a day?

Is there any merit in "farming out" CI tests to those with boards? I think there was a discussion about NuttX owning a suite of boards but not sure where that got to - and would depend on just 1 or 2 people managing it.

As an aside, is there a guide to self-running CI? As I work on a custom board it would be good for me to do this occasionally but I have noi idea where to start!

@lupyuen
Copy link
Member Author

lupyuen commented Oct 17, 2024

@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a

@TimJTi
Copy link
Contributor

TimJTi commented Oct 17, 2024

@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a

And I just RTFM...the "official" guide is here so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements I find are needed to the NuttX "How To" documentation

@jerpelea
Copy link
Contributor

jerpelea commented Oct 17, 2024 via email

@michallenc
Copy link
Contributor

michallenc commented Oct 17, 2024

@TimJTi Here's how I do daily testing on Milk-V Duo S SBC: https://lupyuen.github.io/articles/sg2000a

And I just RTFM...the "official" guide is here so I'll review both and hopefully get it working - and submit any tweaks/corrections/enhancements I find are needed to the NuttX "How To" documentation

These work, but it does not describe the entire CI, just how to run pytest checks for sim:citest configuration.

@cederom
Copy link
Contributor

cederom commented Oct 17, 2024

Yes let's cut what we can (but to keep at least minimal functional configure, build, syntax testing) and see what are the cost reduction. We need to show Apache we are working on the problem. So far optimitzations did not cut the use and we are in danger of loosing all CI :-(

On the other hand that seems not fair to share the same CI quota as small projects. NuttX is a fully featured RTOS working on ~1000 different devices. In order to keep project code quality we need the CI.

Maybe its time to rethink / redesign from scratch the CI test architecture and implementation?

@cederom
Copy link
Contributor

cederom commented Oct 17, 2024

Another problem is that people very often send unfinished undescribed PRs that are updated without a comment or request that triggers whole big CI process several times :-(

Some changes are sometimes required and we cannot avoid that this is part of the process. But maybe we can make something more "adaptive" so only minimal CI is launched by default, preferably only in area that was changed, then with all approvals we can make one manual trigger final big check before merge?

Long story short: We can switch CI test runs to manual trigger for now to see how it reduces costs. I would see two buttons to start Basic and Advanced (maybe also Full = current setup) CI.

@lupyuen
Copy link
Member Author

lupyuen commented Oct 17, 2024

@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g. rv-virt:nsh. Then the CI Workflow should do tools/configure.sh rv-virt:nsh && make. Before starting the whole CI Build?

@cederom
Copy link
Contributor

cederom commented Oct 17, 2024

@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g. rv-virt:nsh. Then the CI Workflow should do tools/configure.sh rv-virt:nsh && make. Before starting the whole CI Build?

People often cant fill even one single sentence to describe Summary, Impact, Testing :D This may be detected automatically.. or we can just see what architecture is the cheapest one and use it for all basic tests..?

@raiden00pl
Copy link
Member

Another problem is that people very often send unfinished undescribed PRs that are updated without a comment or request that triggers whole big CI process several times :-(

Often contributors use CI to test all configuration instead of testing changes locally. On one hand I understand this because compiling all configurations on a local machine takes a lot of time, on the other hand I'm not sure if CI is for this purpose (especially when we have limits on its use).

@cederom Maybe our PRs should have a Mandatory Field: Which NuttX Config to build, e.g. rv-virt:nsh. Then the CI Workflow should do tools/configure.sh rv-virt:nsh && make. Before starting the whole CI Build?

It won't work. Users are lazy, and in order to choose what needs to be compiled correctly, you need a comprehensive knowledge of the entire NuttX, which is not that easy.
The only reasonable option is to automate this process.

@cederom
Copy link
Contributor

cederom commented Oct 17, 2024

So it looks like for now, where dramatic steps need to be taken, we need to mark all PR as drafts and start CI by hand when we are sure all is ready for merge? o_O

@jerpelea
Copy link
Contributor

jerpelea commented Oct 17, 2024 via email

xiaoxiang781216 pushed a commit to apache/nuttx-apps that referenced this issue Oct 17, 2024
This PR disables all CI Jobs for macOS and Windows, to reduce GitHub Cost. Details here: apache/nuttx#14376
lupyuen added a commit to lupyuen2/wip-nuttx that referenced this issue Oct 17, 2024
When we submit or update a Complex PR that affects All Architectures (Arm, RISC-V, Xtensa, etc): CI Workflow shall run only half the jobs. Previously CI Workflow will run `arm-01` to `arm-14`, now we will run only `arm-01` to `arm-07`.

When the Complex PR is Merged: CI Workflow will still run all jobs `arm-01` to `arm-14`

Simple PRs with One Single Arch / Board will build the same way as before: `arm-01` to `arm-14`

This is explained here: apache#14376

Note that this version of `arch.yml` has diverged from `nuttx-apps`, since we are unable to merge apache#14377
@stbenn
Copy link
Contributor

stbenn commented Oct 25, 2024

@lupyuen It looks I made a mistake with some commit messages, that caused our branch to get referenced to a few issues in the apache repo. My apologies. I believe I have removed the commit message references, but if there is anything else I need to do to fix this, please let me know and I will get on it ASAP.

@lupyuen
Copy link
Member Author

lupyuen commented Oct 25, 2024

@stbenn No worries thanks :-)

@lupyuen
Copy link
Member Author

lupyuen commented Oct 25, 2024

4 Days to Festivity: Yesterday we consumed 13 Full-Time GitHub Runners (half of the ASF Quota for GitHub Runners)...

Screenshot 2024-10-26 at 7 32 10 AM

Past 7 Days: We used an average of 9 Full-Time GitHub Runners...

Screenshot 2024-10-26 at 7 37 14 AM

So we're on track to make ASF very happy on 30 Oct! Let's monitor today...

(Live Image) (Live Log)

@cederom
Copy link
Contributor

cederom commented Oct 26, 2024

Thank you @lupyuen for your amazing work!! Have a good calm weekend :-) :-)

@lupyuen
Copy link
Member Author

lupyuen commented Oct 26, 2024

3 Days to Tranquility: Yesterday was a quiet Saturday (no more Release Builds yay!). We consumed only 4 Full-Time GitHub Runners...

Screenshot 2024-10-27 at 6 08 34 AM

Let's hope today will be a peaceful Sunday...

(Live Image) (Live Log)

@lupyuen
Copy link
Member Author

lupyuen commented Oct 27, 2024

Something strange about Network Timeouts in our Docker Workflows: First Run fails while downloading something from GitHub:

Configuration/Tool: imxrt1050-evk/libcxxtest,CONFIG_ARM_TOOLCHAIN_GNU_EABI
curl: (28) Failed to connect to github.com port 443 after 134188 ms: Connection timed out
make[1]: *** [libcxx.defs:28: libcxx-17.0.6.src.tar.xz] Error 28

Second Run fails again, while downloading NimBLE from GitHub:

Configuration/Tool: nucleo-wb55rg/nimble,CONFIG_ARM_TOOLCHAIN_GNU_EABI
curl: (28) Failed to connect to github.com port [443](https://github.com/nuttxpr/nuttx/actions/runs/11535899222/job/32112716849#step:7:444) after 134619 ms: Connection timed out
make[2]: *** [Makefile:55: /github/workspace/sources/apps/wireless/bluetooth/nimble_context] Error 2

Third Run succeeds. Why do we keep seeing these errors: GitHub Actions with Docker, can't connect to GitHub itself?

Is something misconfigured in our Docker Image? But the exact same Docker Image runs fine on my own Build Farm. It doesn't show any errors.

Is GitHub Actions starting our Docker Container with the wrong MTU (Network Packet Size)? 🤔

Meanwhile I'm running a script to Restart Failed Jobs on our NuttX Mirror Repos: restart-failed-job.sh

@lupyuen
Copy link
Member Author

lupyuen commented Oct 27, 2024

2 Days to Transcendence: Yesterday we consumed 10 Full-Time GitHub Runners. We peaked briefly at 21 while compiling a few NuttX Apps.

Screenshot 2024-10-28 at 6 16 33 AM

Let's keep on monitoring thanks!

(Live Image) (Live Log)

@lupyuen
Copy link
Member Author

lupyuen commented Oct 28, 2024

Monitoring our CI Servers 24 x 7

This runs on my 4K TV (Xiaomi 65-inch) all day, all night:

Screenshot 2024-10-28 at 1 53 26 PM

When I'm out on Overnight Hikes: I check my phone at every water break:
GridArt_20241028_150938083

I have GitHub Scripts that will run on Termux Android (remember to pkg install gh and set GITHUB_TOKEN):

@cederom
Copy link
Contributor

cederom commented Oct 28, 2024

Lup's Operations Center =)

@lupyuen
Copy link
Member Author

lupyuen commented Oct 28, 2024

1 Day to Utopia: Yesterday was a busy Monday, we consumed 14 Full-Time GitHub Runners. That's 56% of the ASF Quota for Full-Time Runners...

Screenshot 2024-10-29 at 6 01 52 AM

We peaked briefly at 26 Full-Time Runners. Let's hang in there thanks! :-)

(Live Image) (Live Log)

@cederom
Copy link
Contributor

cederom commented Oct 28, 2024

2 days but we should be fine thanks to our Super Hero @lupyuen !! AVE =)

@lupyuen
Copy link
Member Author

lupyuen commented Oct 28, 2024

Thank you so much @cederom! :-)

@jerpelea
Copy link
Contributor

jerpelea commented Oct 29, 2024 via email

@lupyuen
Copy link
Member Author

lupyuen commented Oct 29, 2024

0 Days to Final Audit: ASF Infra Team will be checking on us one last time today! Yesterday was a super busy Tuesday, we consumed 15 Full-Time GitHub Runners (peaked briefly at 31)

Screenshot 2024-10-30 at 6 02 25 AM

Past 7 Days: We consumed 12 Full-Time Runners, which is half the ASF Quota of 25 Full-Time Runners yay!

Screenshot 2024-10-30 at 6 06 21 AM

FYI: Our "Monthly Bill" for GitHub Actions used to be $18K...

before-30days

Right now our Monthly Bill is $14K. And still dropping!

after-30days

Let's wait for the good news from ASF, thank you everyone! 🙏

(Live Image) (Live Log)

@cederom
Copy link
Contributor

cederom commented Oct 29, 2024

🙏 🙏 🙏

@lupyuen
Copy link
Member Author

lupyuen commented Oct 30, 2024

GitHub Actions had some laggy issues just now: https://www.githubstatus.com/incidents/9yk1fbk0qjjc

So please ignore the over-inflated data in our report (because everything got lagged). Thanks!

(Live Image) (Live Log)

@lupyuen
Copy link
Member Author

lupyuen commented Oct 31, 2024

It's Oct 31 and our CI Servers are still running. We made it yay! 🎉

We got plenty to do:

  1. We made lots of fixes to the CI Workflow. I'll document everything in an article.

  2. Become more resilient and self-sufficient with Our Own Build Farm (away from GitHub)

  3. Analyse our Build Logs with Our Own Tools (instead of GitHub)

Thank you everyone for making this happen! 🙏

Live Update: Full-Time GitHub Runners

(Live Image) (Live Log)

@lupyuen lupyuen closed this as completed Oct 31, 2024
@cederom
Copy link
Contributor

cederom commented Oct 31, 2024

BIG THANK YOU @lupyuen FOR YOUR HELP, TIME, AND PATIENCE!!
YOU SAVED NUTTX'S CI MAAAN :-)

lupyuen added a commit to lupyuen2/wip-nuttx-apps that referenced this issue Nov 3, 2024
Due to the [recent cost-cutting](apache/nuttx#14376), we are no longer running PR Merge Jobs in the `nuttx` and `nuttx-apps` repos. For this to happen, I am now running a script on my computer that will cancel any PR Merge Jobs that appear: [kill-push-master.sh](https://github.com/lupyuen/nuttx-release/blob/main/kill-push-master.sh)

This PR disables PR Merge Jobs permanently, so that we no longer need to run the script. This prevents our CI Charges from over-running, in case the script fails to operate properly.
lupyuen added a commit to lupyuen2/wip-nuttx that referenced this issue Nov 3, 2024
Due to the [recent cost-cutting](apache#14376), we are no longer running PR Merge Jobs in the `nuttx` and `nuttx-apps` repos. For this to happen, I am now running a script on my computer that will cancel any PR Merge Jobs that appear: [kill-push-master.sh](https://github.com/lupyuen/nuttx-release/blob/main/kill-push-master.sh)

This PR disables PR Merge Jobs permanently, so that we no longer need to run the script. This prevents our CI Charges from over-running, in case the script fails to operate properly.
xiaoxiang781216 pushed a commit that referenced this issue Nov 4, 2024
Due to the [recent cost-cutting](#14376), we are no longer running PR Merge Jobs in the `nuttx` and `nuttx-apps` repos. For this to happen, I am now running a script on my computer that will cancel any PR Merge Jobs that appear: [kill-push-master.sh](https://github.com/lupyuen/nuttx-release/blob/main/kill-push-master.sh)

This PR disables PR Merge Jobs permanently, so that we no longer need to run the script. This prevents our CI Charges from over-running, in case the script fails to operate properly.
xiaoxiang781216 pushed a commit to apache/nuttx-apps that referenced this issue Nov 4, 2024
Due to the [recent cost-cutting](apache/nuttx#14376), we are no longer running PR Merge Jobs in the `nuttx` and `nuttx-apps` repos. For this to happen, I am now running a script on my computer that will cancel any PR Merge Jobs that appear: [kill-push-master.sh](https://github.com/lupyuen/nuttx-release/blob/main/kill-push-master.sh)

This PR disables PR Merge Jobs permanently, so that we no longer need to run the script. This prevents our CI Charges from over-running, in case the script fails to operate properly.
@lupyuen
Copy link
Member Author

lupyuen commented Nov 10, 2024

[Article] Optimising the Continuous Integration for NuttX

Within Two Weeks: We squashed our GitHub Actions spending from $4,900 (weekly) down to $890. Thank you everyone for helping out, we saved our CI Servers from shutdown! 🎉

This article explains everything we did in the (Semi-Chaotic) Two Weeks:

(1) Shut down the macOS and Windows Builds, revive them in a different form

(2) Merge Jobs are super costly, we moved them to the NuttX Mirror Repo

(3) We Halved the CI Checks for Complex PRs

(4) Simple PRs are already quite fast. (Sometimes 12 Mins!)

(5) Coding the Build Rules for our CI Workflow, monitoring our CI Servers 24 x 7

(6) We can’t run All CI Checks, but NuttX Devs can help ourselves!

Check out the article: https://lupyuen.codeberg.page/articles/ci3.html

ci3-title

JaeheeKwon pushed a commit to JaeheeKwon/nuttx that referenced this issue Nov 28, 2024
Due to the [recent cost-cutting](apache#14376), we are no longer running PR Merge Jobs in the `nuttx` and `nuttx-apps` repos. For this to happen, I am now running a script on my computer that will cancel any PR Merge Jobs that appear: [kill-push-master.sh](https://github.com/lupyuen/nuttx-release/blob/main/kill-push-master.sh)

This PR disables PR Merge Jobs permanently, so that we no longer need to run the script. This prevents our CI Charges from over-running, in case the script fails to operate properly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment