Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI (Buildkite, GHA): Allow any user with triage or commit permissions to retry all failed Buildkite jobs #42138

Merged
merged 6 commits into from
Sep 11, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@ CODEOWNERS @JuliaLang/github-actions
/.github/ @JuliaLang/github-actions
/.buildkite/ @JuliaLang/github-actions

/.github/workflows/retry.yml @DilumAluthge
/.github/workflows/statuses.yml @DilumAluthge
184 changes: 184 additions & 0 deletions .github/workflows/retry.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# Please ping @DilumAluthge when making any changes to this file.

# Here are some steps that we take in this workflow file for security reasons:
# 1. We do not checkout any code.
# 2. We do not run any external actions.
# 3. We do not give the `GITHUB_TOKEN` any permissions.
# 4. We only give the Buildkite API token (`BUILDKITE_API_TOKEN_RETRY`) the minimum necessary
# set of permissions.

# Important note to Buildkite maintainers:
# In order to make this work, you need to tell Buildkite that it should NOT create a brand-new
# build when someone closes and reopens a pull request. To do so:
# 1. Go to the relevant pipeline (e.g. https://buildkite.com/julialang/julia-master).
# 2. Click on the "Pipeline Settings" button.
# 3. In the left sidebar, under "Pipeline Settings", click on "GitHub".
# 4. In the "GitHub Settings", under "Build Pull Requests", make sure that the "Skip pull
# request builds for existing commits" checkbox is checked. This is the setting that tells
# Buildkite that it should NOT create a brand-new build when someone closes and reopens a
# pull request.
# 5. At the bottom of the page, click the "Save GitHub Settings" button.

name: Retry Failed Buildkite Jobs

on:
# When using the `pull_request_target` event, all PRs will get access to secret environment
# variables (such as the `BUILDKITE_API_TOKEN_RETRY` secret environment variable), even if
# the PR is from a fork. Therefore, for security reasons, we do not checkout any code in
# this workflow.
# TODO: change `pull_request` to `pull_request_target`.
staticfloat marked this conversation as resolved.
Show resolved Hide resolved
pull_request:

# TODO: delete the following line (once we have completely transitioned from Buildbot to Buildkite)
types: [ reopened, labeled ]

# TODO: uncomment the following line (once we have completely transitioned from Buildbot to Buildkite)
# types: [ reopened ]
Comment on lines +31 to +35
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these TODOs still relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves some explanation.

Okay, so if you look at the workflow file currently, you'll see it runs on two types of event:

  1. When the Buildkite - retry failed jobs label is added to a PR.
  2. When a PR is reopened after being closed.

Once we have fully migrated to Buildkite, I will remove option 1 and only have option 2.

In the current situation, in which we have both Buildbot and Buildkite, here is what happens in each event:

  1. Label is added: In this event, all of the failed Buildkite jobs will be rerun. None of the passed Buildkite jobs will be rerun. Nothing happens to the Buildbot jobs.
  2. PR is reopened after being closed: In this event, all of the failed Buildkite jobs will be rerun. None of the passed Buildkite jobs will be rerun. All of the Buildbot jobs (both passed and failed) will be rerun.

Once we have migrated everything to Buildkite, and we no longer have anything on Buildbot, here is what happens in each event:

  1. Label is added: In this event, all of the failed Buildkite jobs will be rerun.
  2. PR is reopened after being closed: In this event, all of the failed Buildkite jobs will be rerun.

So, currently (in the Buildbot + Buildkite world), the "Label is added" event is preferable, because it doesn't rerun the passed Buildbot jobs, and thus conserves CI resources.

But in the future Buildkite-only world, the "Label is added" event is exactly identical in effect to the "PR is reopened after being closed" event. And I think, personally, that telling someone to "close and reopen a PR, and it will only rerun failed CI" is a much more user-friendly interface than this label business.

So, once we are in the Buildkite-only world, I want to completely remove the label trigger, and only have the "close and reopen the PR" trigger, because I think it's easier for people to use.

So these TODOs are here to remind me to get rid of the label stuff once we are Buildkite-only.

What do you think?


# We do not give the `GITHUB_TOKEN` any permissions.
permissions:
statuses: none

jobs:
retry:
name: retry
runs-on: ubuntu-latest

# TODO: delete the following line (once we have completely transitioned from Buildbot to Buildkite)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant? I thought this PR was buildkite-specific and has nothing to do with buildbot.

if: github.repository == 'JuliaLang/julia' && (github.event.label.name == 'Buildkite - retry failed jobs' || github.event.action == 'reopened')

# TODO: uncomment the following line (once we have completely transitioned from Buildbot to Buildkite)
# if: github.repository == 'JuliaLang/julia'

steps:
# For security reasons, we do not checkout any code in this workflow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as above; can't someone malicious just edit the yml?

- run: echo The pull request number is ${{github.event.number}}
- run: |
function get_build_number(pr_number::AbstractString;
organization_slug::AbstractString,
pipeline_slug::AbstractString)
_pr_number = strip(pr_number)
isempty(_pr_number) && throw(ArgumentError("You must provide a valid pull request number"))
builds_per_page = 100
max_pages = 100
for page_number in 1:max_pages
url = string(
"https://api.buildkite.com/v2/",
"organizations/$(organization_slug)/",
"pipelines/$(pipeline_slug)/",
"builds?per_page=$(builds_per_page)&page=$(page_number)",
)
cmd_string = string(
"curl -X GET -H \"Authorization: Bearer $(ENV["BUILDKITE_API_TOKEN_RETRY"])\" \"$(url)\"",
"| ",
"jq '[.[] | {pr: .pull_request.id, id: .id, number: .number}]' ",
"| ",
"jq '.[] | select(.pr == \"$(_pr_number)\") | .number'",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoever said Julia didn't make a good shell scripting language? :P

cmd = `bash -c $(cmd_string)`
sleep(0.1) # helps us stay under the Buildkite API rate limits
str = read(cmd, String)
lines = strip.(strip.(strip.(split(strip(str), '\n')), Ref('"')))
filter!(x -> !isempty(x), lines)
if !isempty(lines)
build_number = convert(String, strip(lines[1]))::String
return build_number
end # if
end # for
msg = string(
"I tried $(max_pages) pages ",
"(with $(builds_per_page) builds per page), ",
"but I could not find any Buildkite builds for ",
"pull request $(pr_number).",
)
throw(ErrorException(msg))
end

function get_failed_job_ids(build_number::AbstractString;
organization_slug::AbstractString,
pipeline_slug::AbstractString)
url = string(
"https://api.buildkite.com/v2/",
"organizations/$(organization_slug)/",
"pipelines/$(pipeline_slug)/",
"builds/$(build_number)",
)
cmd_string_1 = string(
"curl -X GET -H \"Authorization: Bearer $(ENV["BUILDKITE_API_TOKEN_RETRY"])\" \"$(url)\"",
"| ",
"jq '[.jobs | . | .[] | {id: .id, state: .state}]' ",
"| ",
"jq '[.[] | select((.state == \"failed\") or (.state == \"errored\"))]'",
)
cmd_string_2 = string(
cmd_string_1,
"| ",
"jq '.[] .id'",
)
cmd_1 = `bash -c $(cmd_string_1)`
cmd_2 = `bash -c $(cmd_string_2)`
sleep(0.1) # helps us stay under the Buildkite API rate limits
run(cmd_1) # print the output to the log, for debugging purposes
sleep(0.1) # helps us stay under the Buildkite API rate limits
str = read(cmd_2, String)
lines = strip.(strip.(strip.(split(strip(str), '\n')), Ref('"')))
filter!(x -> !isempty(x), lines)
failed_job_ids = convert(Vector{String}, lines)
return failed_job_ids
end

function retry_job(build_number::AbstractString,
job_id::AbstractString;
organization_slug::AbstractString,
pipeline_slug::AbstractString)
url = string(
"https://api.buildkite.com/v2/",
"organizations/$(organization_slug)/",
"pipelines/$(pipeline_slug)/",
"builds/$(build_number)/",
"jobs/$(job_id)/retry",
)
cmd_string = string(
"curl -X PUT -H \"Authorization: Bearer $(ENV["BUILDKITE_API_TOKEN_RETRY"])\" \"$(url)\"",
)
cmd = `bash -c $(cmd_string)`
sleep(0.1) # helps us stay under the Buildkite API rate limits
run(cmd)
return nothing
end

function retry_jobs(build_number::AbstractString,
job_ids::AbstractVector{<:AbstractString};
organization_slug::AbstractString,
pipeline_slug::AbstractString)
if isempty(job_ids)
@info "There are no jobs to retry."
end
num_jobs = length(job_ids)
for (i, job_id) in enumerate(job_ids)
@info "$(i) of $(num_jobs). Attempting to retry job: $(job_id)"
retry_job(build_number, job_id; organization_slug, pipeline_slug)
end
return nothing
end

function main(pr_number::AbstractString;
organization_slug::AbstractString,
pipeline_slug::AbstractString)
@info "The pull request number is $(pr_number)"
build_number = get_build_number(pr_number; organization_slug, pipeline_slug)
@info "The build number is $(build_number)"
failed_job_ids = get_failed_job_ids(build_number; organization_slug, pipeline_slug)
@info "There are $(length(failed_job_ids)) failed jobs." failed_job_ids
retry_jobs(build_number, failed_job_ids; organization_slug, pipeline_slug)
return nothing
end

const pr_number = "${{github.event.number}}"
const organization_slug = "julialang/"
const pipeline_slug = "julia-master/"

main(pr_number; organization_slug, pipeline_slug)
shell: julia --color=yes {0}
env:
BUILDKITE_API_TOKEN_RETRY: ${{ secrets.BUILDKITE_API_TOKEN_RETRY }}
12 changes: 3 additions & 9 deletions .github/workflows/statuses.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,11 @@
# This is just a short-term solution until we have migrated all of CI to Buildkite.
#
# 1. TODO: delete this file once we have migrated all of CI to Buildkite.
#
# 2. TODO: disable GitHub Actions on the `JuliaLang/julia` repository once we have migrated all
# of CI to Buildkite.

# Here are some steps that we take in this workflow file for security reasons:
# 1. We do not checkout any code.
# 2. We do not run any external actions.
# 3. We only give `GITHUB_TOKEN` the minimum necessary set of permissions.
# 3. We only give the `GITHUB_TOKEN` the minimum necessary set of permissions.

name: Statuses

Expand All @@ -27,7 +24,7 @@ on:
- 'master'
- 'release-*'

# These are the permissions for the `GITHUB_TOKEN` token.
# These are the permissions for the `GITHUB_TOKEN`.
# We should only give the token the minimum necessary set of permissions.
permissions:
statuses: write
Expand All @@ -37,15 +34,12 @@ jobs:
name: statuses
runs-on: ubuntu-latest
if: github.repository == 'JuliaLang/julia'
strategy:
fail-fast: false
steps:
# For security reasons, we do not checkout any code in this workflow.
- run: echo "SHA=${{ github.event.pull_request.head.sha }}" >> $GITHUB_ENV
if: github.event_name == 'pull_request_target'

- run: echo "SHA=${{ github.sha }}" >> $GITHUB_ENV
if: github.event_name != 'pull_request_target'

- run: echo "The SHA is ${{ env.SHA }}"

# As we incrementally migrate individual jobs from Buildbot to Buildkite, we should
Expand Down
2 changes: 1 addition & 1 deletion base/Base.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# This file is a part of Julia. License is MIT: https://julialang.org/license

baremodule Base
baremodule Base

using Core.Intrinsics, Core.IR

Expand Down