-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using --local_extra_resources limits concurrency #18153
Comments
Looks like it still happens on both 7.0.0 and a build from that PR, unfortunately. I guess that 0725711 means it now happens for memory and CPU if it didn't already! |
The commit / PR mentioned doesn't change anything, it just consolidates the flags |
The issue connected not with resource To summarize I want to say that this is intended behaviour because of concurrency on CPU. |
If you run that example for a while, you'll see the number of concurrently-running jobs decreases below the number of cpus available, even though there are still actions available that do not depend on the resource foo. Essentially the actions scheduled that depend on the resource foo block actions from running that don't depend on foo. Please re-open this, it is a real issue. |
I can confirm that this does happen (and have also discussed with @wilwell). When an action execution thread attempts to execute an action that requires local resources, it blocks the thread and waits until they are available, so it is blocking other actions that could be running that don't require any resources. The bottleneck here becomes the number of --jobs specified. The ideal solution would be to have the That being said, this could be mitigated by increasing the number of How much does this impact performance of your builds (I assume the repro you gave is a more extreme example)? And does increasing |
The repro is a somewhat extreme example, but our build is bottlenecked around a comparatively small (compared to CPU) number of resources. How much this affects our build, I'm not sure since its hard to measure the case where this behaviour doesn't exist. We only have only one local resource in high contention, so I imagine it wouldn't have a huge impact since you need to wait for the actions that require that resource to finish anyway. I can imagine this is a larger issue is you have multiple resources in high contention (e.g. foo and bar), since actions that are blocked on waiting for foo would block those that could be using bar. However, we don't have that situation yet. |
I think that increasing
While the issue is present, I don't think we have sufficient reasons to justify the implementation of a new feature to combat this ATM unless we see a significant impacts on build performance on non-niche cases. This is especially since async execution with virtual threads is on the horizon, which would probably mitigate this issue. |
Right, yes. I was thinking that increasing |
Actually is that true? I seem to remember that the number of concurrent jobs (beyond the number of CPUs) can be increased solely by increasing |
Perhaps you were thinking about this? IIUC, If the other actions are remote, they aren't limited / blocked because only the local action execution code paths call
|
Makes sense, thanks! |
Description of the bug:
If some tests require extra resources (via
--local_extra_resources
) but others don't, the concurrency of tests that do not require the extra resource is limited by tests that do require the extra resources being scheduled but not starting. These tests that are scheduled but not started count as a concurrent running job, but sit there doing nothing when a job that does not require the resource could be running.What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.
A reproducer is available at the following repository: https://github.com/cameron-martin/bazel-extra-resources-scheduling-bug
Tests can be run like so:
Half of these tests do not require extra resources, so concurrency should not be limited until these complete. Instead, the number of concurrent jobs drops to way below the maximum since tests that require an unavailable resource are scheduled but cannot yet start.
Which operating system are you running Bazel on?
Ubuntu 22.04
What is the output of
bazel info release
?release 6.1.2
If
bazel info release
returnsdevelopment version
or(@non-git)
, tell us how you built Bazel.No response
What's the output of
git remote get-url origin; git rev-parse master; git rev-parse HEAD
?No response
Have you found anything relevant by searching the web?
No response
Any other information, logs, or outputs that you want to share?
https://bazelbuild.slack.com/archives/CA31HN1T3/p1681922669112529
The text was updated successfully, but these errors were encountered: