Using --local_extra_resources limits concurrency #18153

cameron-martin · 2023-04-20T12:12:10Z

Description of the bug:

If some tests require extra resources (via --local_extra_resources) but others don't, the concurrency of tests that do not require the extra resource is limited by tests that do require the extra resources being scheduled but not starting. These tests that are scheduled but not started count as a concurrent running job, but sit there doing nothing when a job that does not require the resource could be running.

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

A reproducer is available at the following repository: https://github.com/cameron-martin/bazel-extra-resources-scheduling-bug

Tests can be run like so:

bazel test //:all

Half of these tests do not require extra resources, so concurrency should not be limited until these complete. Instead, the number of concurrent jobs drops to way below the maximum since tests that require an unavailable resource are scheduled but cannot yet start.

Which operating system are you running Bazel on?

Ubuntu 22.04

What is the output of `bazel info release`?

release 6.1.2

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

No response

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

No response

Have you found anything relevant by searching the web?

No response

Any other information, logs, or outputs that you want to share?

https://bazelbuild.slack.com/archives/CA31HN1T3/p1681922669112529

The text was updated successfully, but these errors were encountered:

brentleyjones · 2024-01-04T18:57:08Z

Is this still an issue? And does 0725711 or #20398 change anything?

cameron-martin · 2024-01-04T21:02:18Z

Looks like it still happens on both 7.0.0 and a build from that PR, unfortunately. I guess that 0725711 means it now happens for memory and CPU if it didn't already!

zhengwei143 · 2024-01-09T13:58:44Z

Is this still an issue? And does 0725711 or #20398 change anything?

Looks like it still happens on both 7.0.0 and a build from that PR, unfortunately. I guess that 0725711 means it now happens for memory and CPU if it didn't already!

The commit / PR mentioned doesn't change anything, it just consolidates the flags --local_{extra,ram,cpu}_resources into a single flag --local_resources. Under the hood, the previous flags are all managed by the ResourceManager so it can happen for memory / CPU, just not as pronounced as because likely no one really keeps count.

wilwell · 2024-01-09T14:13:19Z

The issue connected not with resource foo but with resource CPU. By default we use 1 CPU for every action, so in your example all 200 jobs are concurring on 16 CPUs (or whatever is your limit on build) and 100 jobs are concurring on 1 Foo resource.
I made an experiment and saw, that there are a lot of test which are trying to get CPU, but couldn't.

To summarize I want to say that this is intended behaviour because of concurrency on CPU.

cameron-martin · 2024-01-09T14:24:42Z

If you run that example for a while, you'll see the number of concurrently-running jobs decreases below the number of cpus available, even though there are still actions available that do not depend on the resource foo.

Essentially the actions scheduled that depend on the resource foo block actions from running that don't depend on foo. Please re-open this, it is a real issue.

zhengwei143 · 2024-01-09T15:26:04Z

I can confirm that this does happen (and have also discussed with @wilwell). When an action execution thread attempts to execute an action that requires local resources, it blocks the thread and waits until they are available, so it is blocking other actions that could be running that don't require any resources. The bottleneck here becomes the number of --jobs specified.

The ideal solution would be to have the ResourceManager be smart enough to figure out when to block or when not to and return the thread to skyframe to execute another action instead (which requires some additional work to pipe through). However, we likely don't want to always return the thread in the absence of resources as that incurs cost of a skyframe restart - so the sweet spot is somewhere in the middle. Implementation of the solution would likely be along the lines of a heuristical analysis to decide which path to take.

That being said, this could be mitigated by increasing the number of --jobs used, and eventually through the use of virtual threads when that becomes available in Bazel (but that's a story for later).

How much does this impact performance of your builds (I assume the repro you gave is a more extreme example)? And does increasing --jobs help?

cameron-martin · 2024-01-09T15:45:10Z

The repro is a somewhat extreme example, but our build is bottlenecked around a comparatively small (compared to CPU) number of resources. How much this affects our build, I'm not sure since its hard to measure the case where this behaviour doesn't exist.

We only have only one local resource in high contention, so I imagine it wouldn't have a huge impact since you need to wait for the actions that require that resource to finish anyway. I can imagine this is a larger issue is you have multiple resources in high contention (e.g. foo and bar), since actions that are blocked on waiting for foo would block those that could be using bar. However, we don't have that situation yet.

zhengwei143 · 2024-01-09T16:09:42Z

I think that increasing --jobs could potentially help if the limited concurrency is affecting the critical path of your build - this would reduce the ratio of blocked actions.

You could also use https://github.com/bazelbuild/bazel-bench to benchmark your build against a higher --jobs and see if it actually makes a difference (I'd be interested to see if this actually causes a regression).
Alternatively, collecting a json trace profile might be a simpler way to look at the critical path of the build and get hints on whether it affects build performance.

While the issue is present, I don't think we have sufficient reasons to justify the implementation of a new feature to combat this ATM unless we see a significant impacts on build performance on non-niche cases. This is especially since async execution with virtual threads is on the horizon, which would probably mitigate this issue.

cameron-martin · 2024-01-09T17:42:39Z

Right, yes. I was thinking that increasing --jobs would cause more jobs to run than the number of CPUs, but I guess --local_cpu_resources will limit that still. Sounds like a reasonable workaround for now.

cameron-martin · 2024-01-09T17:45:02Z

Actually is that true? I seem to remember that the number of concurrent jobs (beyond the number of CPUs) can be increased solely by increasing --jobs. Do jobs by default not set cpus:1? I'll have to test this out tomorrow.

zhengwei143 · 2024-01-09T18:13:52Z

Do jobs by default not set cpus:1?

Perhaps you were thinking about this?

IIUC, --local_cpu_resources only limits local actions that acquire CPU resources, and restricts concurrency of actions based on your HOST_RAM (unless you've explicitly specified a different --local_cpu_resources). If you have a lot of local actions, increasing --jobs will probably increase concurrency up until the CPU resource itself becomes the next bottleneck - which is what you mentioned.

If the other actions are remote, they aren't limited / blocked because only the local action execution code paths call ResourceManager#acquireResources.

--jobs just specifies the number of threads used by Blaze to execute concurrent actions, whether or not each thread acquires resources from the ResourceManager depends on how the action is run (local / remote).

cameron-martin · 2024-01-09T18:17:14Z

Makes sense, thanks!

cameron-martin added type: bug untriaged labels Apr 20, 2023

cameron-martin assigned Pavank1992 and sgowroji Apr 20, 2023

Pavank1992 added the team-Local-Exec Issues and PRs for the Execution (Local) team label Apr 20, 2023

Pavank1992 unassigned sgowroji and Pavank1992 Apr 20, 2023

coeuvre assigned wilwell Apr 25, 2023

coeuvre added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Apr 25, 2023

jmmv mentioned this issue Sep 20, 2023

Adding a resources:ram:1234 tag to a test doesn't work and causes a crash #19572

Closed

wilwell closed this as completed Jan 9, 2024

zhengwei143 reopened this Jan 9, 2024

Ryang20718 mentioned this issue May 31, 2024

Local_test_resources doesn't actually restrict concurrency when --local_test_jobs is specified #22598

Open

wilwell assigned zhengwei143 and unassigned wilwell Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using --local_extra_resources limits concurrency #18153

Using --local_extra_resources limits concurrency #18153

cameron-martin commented Apr 20, 2023 •

edited

Loading

brentleyjones commented Jan 4, 2024

cameron-martin commented Jan 4, 2024

zhengwei143 commented Jan 9, 2024

wilwell commented Jan 9, 2024

cameron-martin commented Jan 9, 2024 •

edited

Loading

zhengwei143 commented Jan 9, 2024

cameron-martin commented Jan 9, 2024

zhengwei143 commented Jan 9, 2024

cameron-martin commented Jan 9, 2024

cameron-martin commented Jan 9, 2024

zhengwei143 commented Jan 9, 2024 •

edited

Loading

cameron-martin commented Jan 9, 2024

Using --local_extra_resources limits concurrency #18153

Using --local_extra_resources limits concurrency #18153

Comments

cameron-martin commented Apr 20, 2023 • edited Loading

Description of the bug:

What's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Which operating system are you running Bazel on?

What is the output of bazel info release?

If bazel info release returns development version or (@non-git), tell us how you built Bazel.

What's the output of git remote get-url origin; git rev-parse master; git rev-parse HEAD ?

Have you found anything relevant by searching the web?

Any other information, logs, or outputs that you want to share?

brentleyjones commented Jan 4, 2024

cameron-martin commented Jan 4, 2024

zhengwei143 commented Jan 9, 2024

wilwell commented Jan 9, 2024

cameron-martin commented Jan 9, 2024 • edited Loading

zhengwei143 commented Jan 9, 2024

cameron-martin commented Jan 9, 2024

zhengwei143 commented Jan 9, 2024

cameron-martin commented Jan 9, 2024

cameron-martin commented Jan 9, 2024

zhengwei143 commented Jan 9, 2024 • edited Loading

cameron-martin commented Jan 9, 2024

cameron-martin commented Apr 20, 2023 •

edited

Loading

What is the output of `bazel info release`?

If `bazel info release` returns `development version` or `(@non-git)`, tell us how you built Bazel.

What's the output of `git remote get-url origin; git rev-parse master; git rev-parse HEAD` ?

cameron-martin commented Jan 9, 2024 •

edited

Loading

zhengwei143 commented Jan 9, 2024 •

edited

Loading