Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: add LUCI openbsd-riscv64 builder #67105

Open
4a6f656c opened this issue Apr 29, 2024 · 23 comments
Open

x/build: add LUCI openbsd-riscv64 builder #67105

4a6f656c opened this issue Apr 29, 2024 · 23 comments
Assignees
Labels
arch-riscv Issues solely affecting the riscv64 architecture. Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done. new-builder OS-OpenBSD
Milestone

Comments

@4a6f656c
Copy link
Contributor

4a6f656c commented Apr 29, 2024

Please add a LUCI builder for openbsd/riscv64 with hostname openbsd-riscv64-jsing - note that this will be running on the same hardware as the current host-openbsd-riscv64-joelsing builder. This machine is unlikely to be capable of running both forms and will need to be hard migrated at some point in time. CSR for this builder is attached.

openbsd-riscv64-jsing.csr.txt

@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Apr 29, 2024
@gopherbot gopherbot added this to the Unreleased milestone Apr 29, 2024
@dmitshur dmitshur moved this to In Progress in Go Release May 10, 2024
@dmitshur dmitshur added OS-OpenBSD arch-riscv Issues solely affecting the riscv64 architecture. labels May 10, 2024
@cherrymui cherrymui added the NeedsFix The path to resolution is known, but the work has not been done. label May 10, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/584777 mentions this issue: main.star: update openbsd-{arm,arm64,riscv64} builder configurations

gopherbot pushed a commit to golang/build that referenced this issue May 10, 2024
The BUILDER_TYPE for these os-arch combinations already exists.
I guess I don't need to add anything?

Update the configurations to include them in low capacity builders
list and copies the timeout scale from the old (non-LUCI) builder
settings.

For golang/go#67103, golang/go#67104, golang/go#67105.

Change-Id: I971be047630edaf9db04346ea2087b663fef2847
Reviewed-on: https://go-review.googlesource.com/c/build/+/584777
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
@dmitshur
Copy link
Contributor

Here's the certificate: openbsd-riscv64-jsing-1715630475.cert.txt.

Please feel free to stop the old coordinator builder on your end whenever it's getting in the way of bringing up the new LUCI builder.

@4a6f656c
Copy link
Contributor Author

@dmitshur - thanks, this one should be up and running, although I have no idea how to confirm this (it logs nothing except for the occasional RPC error).

@dmitshur
Copy link
Contributor

dmitshur commented May 27, 2024

Thanks for setting it up! I see it at https://ci.chromium.org/ui/p/golang/builders/luci.golang.ci/gotip-openbsd-riscv64?limit=200 and https://chromium-swarm.appspot.com/bot?id=openbsd-riscv64-jsing (the latter link requires signing in to view, any account will work). In a recent build like https://ci.chromium.org/b/8747215393872298177, I see it failed during cipd ensure on fetching the bootstrap toolchain. Sent CL 588356 for that.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/588356 mentions this issue: main.star: use gotip-ish as bootstrap for openbsd/riscv64

gopherbot pushed a commit to golang/build that referenced this issue May 28, 2024
The openbsd/riscv64 port is newly added in Go 1.23, so it's not viable
to use Go 1.20.6 or Go 1.21.0 as its bootstrap version. Use the latest
tip development version available at this moment.

For golang/go#67105.

Change-Id: Ia72042065fac0babeab24afa44ee8937f9bc7f46
Reviewed-on: https://go-review.googlesource.com/c/build/+/588356
Reviewed-by: Michael Knyszek <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Auto-Submit: Dmitri Shuralyov <[email protected]>
Reviewed-by: Joel Sing <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
@dmitshur
Copy link
Contributor

dmitshur commented May 30, 2024

Looking at the most recent builds, make.bash is now successfully completing but taking 1.1-1.2 hrs to do so. That seems quite long for make.bash alone—is it expected? It's then consistently timing out during the "upload prebuilt go" step, taking over 7 minutes. (This step should complete in under a minute with fast upload speeds, or at least within a few minutes.)

@4a6f656c You mentioned the hardware might not be capable of running both builder workloads at once, and the results above agree. Can you try stopping the old builder to see how the LUCI one works when it's the only one running?

@4a6f656c
Copy link
Contributor Author

Looking at the most recent builds, make.bash is now successfully completing but taking 1.1-1.2 hrs to do so. That seems quite long for make.bash alone—is it expected?

@dmitshur - Yes, a make.bash run on this machine is around 45 minutes when nothing else is running.

It's then consistently timing out during the "upload prebuilt go" step, taking over 7 minutes. (This step should complete in under a minute with fast upload speeds, or at least within a few minutes.)

They are in a fairly network constrained environment - download speeds are acceptable, upload is severely limited (plus the RTT to the US is fairly high, which is where I presume it's trying to upload to). The existing builders do not upload snapshots for these reasons:

https://github.com/golang/build/blob/master/dashboard/builders.go#L2075

@4a6f656c You mentioned the hardware might not be capable of running both builder workloads at once, and the results above agree. Can you try stopping the old builder to see how the LUCI one works when it's the only one running?

It will make the builds slightly faster, however it will not help the upload issue...

@4a6f656c
Copy link
Contributor Author

@dmitshur what are the next steps here? How do we disable snapshot uploads for these builders?

@dmitshur
Copy link
Contributor

That mechanism isn't currently available for LUCI builders.

Without snapshots, and with make.bash taking 45 minutes, it makes testing golang.org/x repos slower and use more resources, but I understand you're willing to accept that trade-off if the upload speed is a hard constraint that you don't have control over. I see that the old coordinator builder was configured to only test x/net and x/sys, and only at tip without any of the release branches (see here). Given the above, is that configuration something you'd need to be incorporated as well?

I added a note so we'll discuss this with the team next week.

@4a6f656c
Copy link
Contributor Author

@dmitshur I've installed a secondary Internet connection, which should provide sufficient upload bandwidth - a successful run appears to have occurred for x/sys, however everything since then has failed with a similar "CAS failed" error:

https://ci.chromium.org/ui/p/golang/builders/ci/gotip-openbsd-riscv64/b8744932690088854977/overview

https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8744932690088854977/+/u/step/6/log/2

No idea what that means or what would be causing that (overall this seems to be one of the more convoluted and opaque systems I've worked with... no mention of URLs or how to reproduce the issue to troubleshoot it):

[E2024-06-17T12:44:06.237606+10:00 57136 0 annotate.go:273] original error: retry budget exhausted (6 attempts): context deadline exceeded

goroutine 1:
#0 go.chromium.org/luci/client/cmd/cas/casimpl/download.go:515 - casimpl.(*downloadRun).doDownload()
  reason: failed to download files

#1 go.chromium.org/luci/client/cmd/cas/casimpl/download.go:574 - casimpl.(*downloadRun).Run()
#2 github.com/maruel/[email protected]/subcommands.go:395 - subcommands.Run()
#3 cas/main.go:95 - main.main()
#4 runtime/proc.go:271 - runtime.main()
#5 runtime/asm_riscv64.s:540 - runtime.goexit()
cas: failed to download files: retry budget exhausted (6 attempts): context deadline exceeded

Any suggestions?

@dmitshur
Copy link
Contributor

dmitshur commented Jun 26, 2024

In case you haven't already seen it, there is open source documentation for the Swarming Bot at https://chromium.googlesource.com/infra/luci/luci-py.git/+/main/appengine/swarming/doc/Bot.md.

The "original error: retry budget exhausted (6 attempts): context deadline exceeded" error reads to me that the time it has taken exceeded the deadline that was given. This is consistent with the task 6 min 40 sec, which is longer than the same task takes on most other builders. You mentioned only the uplink speed was increased, whereas the failed task was "fetch prebuilt go" whose speed would depend on the downlink speed.

We discussed this in a meeting and we're okay with maintaining support for disabling prebuilt toolchain uploads and downloads. I'm not sure how soon we can implement it ourselves, but it can happen sooner if you're willing to work on it. It'd be best to file a separate issue for it for tracking purposes. Also note that it's expected the x/ repo builds will be slowed by the 45 min toolchain build each time, but I understand that's a trade-off you're okay with. In theory, it's possible for you to consider a mechanism to locally cache prebuilt toolchains as a means of speeding that up without needing to go over the internet.

@4a6f656c
Copy link
Contributor Author

In case you haven't already seen it, there is open source documentation for the Swarming Bot at https://chromium.googlesource.com/infra/luci/luci-py.git/+/main/appengine/swarming/doc/Bot.md.

I wasn't aware of that, thanks.

The "original error: retry budget exhausted (6 attempts): context deadline exceeded" error reads to me that the time it has taken exceeded the deadline that was given. This is consistent with the task 6 min 40 sec, which is longer than the same task takes on most other builders. You mentioned only the uplink speed was increased, whereas the failed task was "fetch prebuilt go" whose speed would depend on the downlink speed.

Sure, but deadline exceeded to download what? If I knew what the URL was or a command to reproduce the download, it would have been much easier to troubleshoot further. The secondary Internet connection is faster on both downlink and uplink - given the above information, I've made some further network changes to increase the download capacity available to the builders:

...
Requesting https://dl.google.com/go/go1.22.4.linux-amd64.tar.gz
...
68964131 bytes received in 41.01 seconds (1.60 MB/s)

We discussed this in a meeting and we're okay with maintaining support for disabling prebuilt toolchain uploads and downloads. I'm not sure how soon we can implement it ourselves, but it can happen sooner if you're willing to work on it. It'd be best to file a separate issue for it for tracking purposes. Also note that it's expected the x/ repo builds will be slowed by the 45 min toolchain build each time, but I understand that's a trade-off you're okay with. In theory, it's possible for you to consider a mechanism to locally cache prebuilt toolchains as a means of speeding that up without needing to go over the internet.

Thanks, I'm hoping that with the secondary Internet connection this is no longer necessary. I'll start swarming again and see how much progress we make.

@dmitshur
Copy link
Contributor

Sure, but deadline exceeded to download what? If I knew what the URL was or a command to reproduce the download, it would have been much easier to troubleshoot further.

Note that the exact "cas download ..." command with all command line flags and environment variables is shown here:

https://logs.chromium.org/logs/golang/buildbucket/cr-buildbucket/8744932690088854977/+/u/step/6/log/1

From this link:

image

The exact "cas download" invocation is constructed in golangbuild. The source for that is linked here. Here's where the "fetch prebuilt go" step is defined and the command it runs:

https://source.chromium.org/chromium/infra/infra/+/main:go/src/infra/experimental/golangbuild/prebuilt_go.go;l=160-197;drc=816bca63478e90d97e316b342c14bb6971b56848

That particular prebuilt toolchain seems to be around 265 MB, so at 1.6 MB/s it should complete in under 3 minutes, which should hopefully be quick enough not to timeout. (I'm also starting to think we could increase cas download timeout if needed.)

@4a6f656c
Copy link
Contributor Author

4a6f656c commented Jul 8, 2024

@dmitshur thanks for providing the additional pointers for troubleshooting.

After some further adjustments to the environment, it seems that we're now running reasonably stably for openbsd/riscv64 on LUCI:

https://ci.chromium.org/ui/p/golang/g/port-openbsd-riscv64/builders

Do we want to cut over the build dashboard?

@dmitshur
Copy link
Contributor

dmitshur commented Jul 9, 2024

Congrats on reaching this point!

Do we want to cut over the build dashboard?

Are you referring to removing the known issue for the builder at https://cs.opensource.google/go/x/build/+/luci-config:main.star;l=486 (similar to CL 596817)? Yes, sounds good, please feel free to send a CL.

@4a6f656c
Copy link
Contributor Author

Are you referring to removing the known issue for the builder at https://cs.opensource.google/go/x/build/+/luci-config:main.star;l=486 (similar to CL 596817)? Yes, sounds good, please feel free to send a CL.

I'm referring to the fact that the LUCI builder is not the one that shows up on https://build.golang.org/ (it's still the old builder, which is currently turned off). I can send a CL to remove the known issue for this builder - is that also going to result in it becoming visible on the dashboard?

@dmitshur
Copy link
Contributor

dmitshur commented Jul 10, 2024

Yes, LUCI builders with a known issue aren’t shown on the build.golang.org dashboard, so removing its known issue will make it visible. Thanks. You can also add an entry to a BuildersPortedToLUCI to hide the old builder from there.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/598155 mentions this issue: main.star: remove known issue for openbsd-riscv64 luci builder

gopherbot pushed a commit to golang/build that referenced this issue Jul 16, 2024
This builder has been migrated and appears to be running stably.

Updates golang/go#67105

Change-Id: Iaafaf669a290b5fc1fda2deea1ed06b2faf7ec60
Reviewed-on: https://go-review.googlesource.com/c/build/+/598155
Reviewed-by: Cherry Mui <[email protected]>
LUCI-TryBot-Result: Go LUCI <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Reviewed-by: Dmitri Shuralyov <[email protected]>
Auto-Submit: Dmitri Shuralyov <[email protected]>
@4a6f656c
Copy link
Contributor Author

4a6f656c commented Aug 2, 2024

@dmitshur - I think we're going to need to reduce the builds back to what they were previously (gotip, net and sys IIRC) - building and testing gotip takes ~2 hours, which means that by the time we add in Go 1.23, x_tools-gotip and x_tools-go1.23 (all around 2 hours each), we're already up to 8+ hours (that means less than three gotip builds make it to build.golang.org in a day).

How can we restore the previous build behaviour?

@dmitshur
Copy link
Contributor

dmitshur commented Aug 2, 2024

We can consider that if it is needed. The relevant parts within LUCI configuration are the enabled function and the project classification dictionary.

A few things to note though. Looking at https://chromium-swarm.appspot.com/bot?id=openbsd-riscv64-jsing, many x/ repos (such as x/image, x/term, x/sync, x/debug, x/crypto, x/arch, x/mod, and so on) are actually taking only 10-15 minutes to test. So dropping them wouldn't buy that much extra time, but lose on quite a bit of coverage for the port.

Another part to consider is that @mengzhuo was working on an openbsd-riscv64-mengzhuo builder in issue #64176. It's not operational right now, but if it can be, the volume of work would be divided across multiple builders and it'd be easier for them to keep up with new Go commits.

@dmitshur
Copy link
Contributor

Checking in on the status here. Did the comment above resolve the need to make further changes, or would you still like to pursue that?

The builder appears to be missing now (https://ci.chromium.org/ui/p/golang/g/port-openbsd-riscv64/builders), and the last build was in September (https://ci.chromium.org/b/8735431351491753009). Are you able to take a look?

@4a6f656c
Copy link
Contributor Author

4a6f656c commented Jan 1, 2025

Checking in on the status here. Did the comment above resolve the need to make further changes, or would you still like to pursue that?

Unfortunately I ran out of time to spend on builder conversions. While increasing coverage would be nice in the longer term, picking up on failures quickly is critical during the Go development/release cycle.

The builder appears to be missing now (https://ci.chromium.org/ui/p/golang/g/port-openbsd-riscv64/builders), and the last build was in September (https://ci.chromium.org/b/8735431351491753009). Are you able to take a look?

The LUCI builder was disabled and the buildlet restarted, in order to get back to the previous state. If you're able to help get the LUCI configuration to match the current behaviour, then I can try turning LUCI back on (otherwise I can look at it when I have more spare time, but that is unlikely to be for at least few months).

@dmitshur
Copy link
Contributor

dmitshur commented Jan 4, 2025

Okay, let's update the LUCI bot's coverage to match the previous one, then. I'll work on a CL for that.

@dmitshur dmitshur self-assigned this Jan 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-riscv Issues solely affecting the riscv64 architecture. Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done. new-builder OS-OpenBSD
Projects
Status: In Progress
Development

No branches or pull requests

4 participants