Skip to content

Commit

Permalink
stateless-ci: rename post
Browse files Browse the repository at this point in the history
  • Loading branch information
bobheadxi committed Aug 29, 2024
1 parent 7976c5b commit 30388f6
Showing 1 changed file with 2 additions and 2 deletions.
4 changes: 2 additions & 2 deletions content/_posts/2022-4-18-stateless-ci.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: "Scaling a CI service with dynamic and stateless Kubernetes Jobs"
title: "Dynamic and stateless Kubernetes Jobs for stable CI"
layout: post
image: https://upload.wikimedia.org/wikipedia/commons/thumb/3/39/Kubernetes_logo_without_workmark.svg/1200px-Kubernetes_logo_without_workmark.svg.png
hero_image: /assets/images/posts/stateless-ci/dashboard.png
Expand Down Expand Up @@ -73,7 +73,7 @@ sequenceDiagram

As long as there are jobs in the Buildkite queue, deployed agent pods would remain online until the autoscaler deems it appropriate to scale down. As such, multiple jobs could be dispatched onto the same agent before the fleet gets scaled down.

While Buildkite has mechanisms for mitigating state issues across jobs, and most Sourcegraph pipelines have cleanup and best practices for migitating them as well, we occasionally still run into "botched" agents. These are particularly prevalent in jobs where tools are installed globally, or Docker containers are started but not correctly cleaned up (for example, if directories are moounted), and so on. We've also had issues where certain pods encounter network issues, causing them to fail all the jobs they accept. We also have jobs work "by accident", especially in some of our more obscure repositories, where jobs rely on tools being installed by other jobs, and suddenly stop working if they land on a "fresh" agent, or those tools get upgraded unexpected.
While Buildkite has mechanisms for mitigating state issues across jobs, and most Sourcegraph pipelines have cleanup and best practices for mitigating them as well, we occasionally still run into "botched" agents. These are particularly prevalent in jobs where tools are installed globally, or Docker containers are started but not correctly cleaned up (for example, if directories are moounted), and so on. We've also had issues where certain pods encounter network issues, causing them to fail all the jobs they accept. We also have jobs work "by accident", especially in some of our more obscure repositories, where jobs rely on tools being installed by other jobs, and suddenly stop working if they land on a "fresh" agent, or those tools get upgraded unexpected.

All of these issues eventually lead us to decide to build a stateless approach to running our Buildkite agents.

Expand Down

0 comments on commit 30388f6

Please sign in to comment.