-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: add LUCI openbsd-riscv64 builder #67105
Comments
Change https://go.dev/cl/584777 mentions this issue: |
The BUILDER_TYPE for these os-arch combinations already exists. I guess I don't need to add anything? Update the configurations to include them in low capacity builders list and copies the timeout scale from the old (non-LUCI) builder settings. For golang/go#67103, golang/go#67104, golang/go#67105. Change-Id: I971be047630edaf9db04346ea2087b663fef2847 Reviewed-on: https://go-review.googlesource.com/c/build/+/584777 LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]>
Here's the certificate: openbsd-riscv64-jsing-1715630475.cert.txt. Please feel free to stop the old coordinator builder on your end whenever it's getting in the way of bringing up the new LUCI builder. |
@dmitshur - thanks, this one should be up and running, although I have no idea how to confirm this (it logs nothing except for the occasional RPC error). |
Thanks for setting it up! I see it at https://ci.chromium.org/ui/p/golang/builders/luci.golang.ci/gotip-openbsd-riscv64?limit=200 and https://chromium-swarm.appspot.com/bot?id=openbsd-riscv64-jsing (the latter link requires signing in to view, any account will work). In a recent build like https://ci.chromium.org/b/8747215393872298177, I see it failed during |
Change https://go.dev/cl/588356 mentions this issue: |
The openbsd/riscv64 port is newly added in Go 1.23, so it's not viable to use Go 1.20.6 or Go 1.21.0 as its bootstrap version. Use the latest tip development version available at this moment. For golang/go#67105. Change-Id: Ia72042065fac0babeab24afa44ee8937f9bc7f46 Reviewed-on: https://go-review.googlesource.com/c/build/+/588356 Reviewed-by: Michael Knyszek <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]> Reviewed-by: Joel Sing <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]>
Looking at the most recent builds, make.bash is now successfully completing but taking 1.1-1.2 hrs to do so. That seems quite long for make.bash alone—is it expected? It's then consistently timing out during the "upload prebuilt go" step, taking over 7 minutes. (This step should complete in under a minute with fast upload speeds, or at least within a few minutes.) @4a6f656c You mentioned the hardware might not be capable of running both builder workloads at once, and the results above agree. Can you try stopping the old builder to see how the LUCI one works when it's the only one running? |
@dmitshur - Yes, a
They are in a fairly network constrained environment - download speeds are acceptable, upload is severely limited (plus the RTT to the US is fairly high, which is where I presume it's trying to upload to). The existing builders do not upload snapshots for these reasons: https://github.com/golang/build/blob/master/dashboard/builders.go#L2075
It will make the builds slightly faster, however it will not help the upload issue... |
@dmitshur what are the next steps here? How do we disable snapshot uploads for these builders? |
That mechanism isn't currently available for LUCI builders. Without snapshots, and with make.bash taking 45 minutes, it makes testing golang.org/x repos slower and use more resources, but I understand you're willing to accept that trade-off if the upload speed is a hard constraint that you don't have control over. I see that the old coordinator builder was configured to only test x/net and x/sys, and only at tip without any of the release branches (see here). Given the above, is that configuration something you'd need to be incorporated as well? I added a note so we'll discuss this with the team next week. |
@dmitshur I've installed a secondary Internet connection, which should provide sufficient upload bandwidth - a successful run appears to have occurred for x/sys, however everything since then has failed with a similar "CAS failed" error: https://ci.chromium.org/ui/p/golang/builders/ci/gotip-openbsd-riscv64/b8744932690088854977/overview No idea what that means or what would be causing that (overall this seems to be one of the more convoluted and opaque systems I've worked with... no mention of URLs or how to reproduce the issue to troubleshoot it):
Any suggestions? |
In case you haven't already seen it, there is open source documentation for the Swarming Bot at https://chromium.googlesource.com/infra/luci/luci-py.git/+/main/appengine/swarming/doc/Bot.md. The "original error: retry budget exhausted (6 attempts): context deadline exceeded" error reads to me that the time it has taken exceeded the deadline that was given. This is consistent with the task 6 min 40 sec, which is longer than the same task takes on most other builders. You mentioned only the uplink speed was increased, whereas the failed task was "fetch prebuilt go" whose speed would depend on the downlink speed. We discussed this in a meeting and we're okay with maintaining support for disabling prebuilt toolchain uploads and downloads. I'm not sure how soon we can implement it ourselves, but it can happen sooner if you're willing to work on it. It'd be best to file a separate issue for it for tracking purposes. Also note that it's expected the x/ repo builds will be slowed by the 45 min toolchain build each time, but I understand that's a trade-off you're okay with. In theory, it's possible for you to consider a mechanism to locally cache prebuilt toolchains as a means of speeding that up without needing to go over the internet. |
I wasn't aware of that, thanks.
Sure, but deadline exceeded to download what? If I knew what the URL was or a command to reproduce the download, it would have been much easier to troubleshoot further. The secondary Internet connection is faster on both downlink and uplink - given the above information, I've made some further network changes to increase the download capacity available to the builders:
Thanks, I'm hoping that with the secondary Internet connection this is no longer necessary. I'll start swarming again and see how much progress we make. |
Note that the exact "cas download ..." command with all command line flags and environment variables is shown here: From this link: The exact "cas download" invocation is constructed in That particular prebuilt toolchain seems to be around 265 MB, so at 1.6 MB/s it should complete in under 3 minutes, which should hopefully be quick enough not to timeout. (I'm also starting to think we could increase cas download timeout if needed.) |
@dmitshur thanks for providing the additional pointers for troubleshooting. After some further adjustments to the environment, it seems that we're now running reasonably stably for openbsd/riscv64 on LUCI: https://ci.chromium.org/ui/p/golang/g/port-openbsd-riscv64/builders Do we want to cut over the build dashboard? |
Congrats on reaching this point!
Are you referring to removing the known issue for the builder at https://cs.opensource.google/go/x/build/+/luci-config:main.star;l=486 (similar to CL 596817)? Yes, sounds good, please feel free to send a CL. |
I'm referring to the fact that the LUCI builder is not the one that shows up on https://build.golang.org/ (it's still the old builder, which is currently turned off). I can send a CL to remove the known issue for this builder - is that also going to result in it becoming visible on the dashboard? |
Yes, LUCI builders with a known issue aren’t shown on the build.golang.org dashboard, so removing its known issue will make it visible. Thanks. You can also add an entry to a BuildersPortedToLUCI to hide the old builder from there. |
Change https://go.dev/cl/598155 mentions this issue: |
This builder has been migrated and appears to be running stably. Updates golang/go#67105 Change-Id: Iaafaf669a290b5fc1fda2deea1ed06b2faf7ec60 Reviewed-on: https://go-review.googlesource.com/c/build/+/598155 Reviewed-by: Cherry Mui <[email protected]> LUCI-TryBot-Result: Go LUCI <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Reviewed-by: Dmitri Shuralyov <[email protected]> Auto-Submit: Dmitri Shuralyov <[email protected]>
@dmitshur - I think we're going to need to reduce the builds back to what they were previously (gotip, net and sys IIRC) - building and testing gotip takes ~2 hours, which means that by the time we add in Go 1.23, x_tools-gotip and x_tools-go1.23 (all around 2 hours each), we're already up to 8+ hours (that means less than three gotip builds make it to build.golang.org in a day). How can we restore the previous build behaviour? |
We can consider that if it is needed. The relevant parts within LUCI configuration are the enabled function and the project classification dictionary. A few things to note though. Looking at https://chromium-swarm.appspot.com/bot?id=openbsd-riscv64-jsing, many x/ repos (such as x/image, x/term, x/sync, x/debug, x/crypto, x/arch, x/mod, and so on) are actually taking only 10-15 minutes to test. So dropping them wouldn't buy that much extra time, but lose on quite a bit of coverage for the port. Another part to consider is that @mengzhuo was working on an |
Checking in on the status here. Did the comment above resolve the need to make further changes, or would you still like to pursue that? The builder appears to be missing now (https://ci.chromium.org/ui/p/golang/g/port-openbsd-riscv64/builders), and the last build was in September (https://ci.chromium.org/b/8735431351491753009). Are you able to take a look? |
Unfortunately I ran out of time to spend on builder conversions. While increasing coverage would be nice in the longer term, picking up on failures quickly is critical during the Go development/release cycle.
The LUCI builder was disabled and the buildlet restarted, in order to get back to the previous state. If you're able to help get the LUCI configuration to match the current behaviour, then I can try turning LUCI back on (otherwise I can look at it when I have more spare time, but that is unlikely to be for at least few months). |
Okay, let's update the LUCI bot's coverage to match the previous one, then. I'll work on a CL for that. |
Please add a LUCI builder for openbsd/riscv64 with hostname
openbsd-riscv64-jsing
- note that this will be running on the same hardware as the currenthost-openbsd-riscv64-joelsing
builder. This machine is unlikely to be capable of running both forms and will need to be hard migrated at some point in time. CSR for this builder is attached.openbsd-riscv64-jsing.csr.txt
The text was updated successfully, but these errors were encountered: