feat: Implement failure circuit breaker #18359

amishra-u · 2023-05-10T00:40:26Z

Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen.

Issue

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

Solution

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes #18136

amishra-u · 2023-05-10T17:50:42Z

#18120 (comment)

I ran multiple experiments with unhealthy remote cache, semi-healthy remote cache and healthy remote cache on full monorepo build, with the following goals.

The build time remains reasonably equivalent when using a circuit breaker and an unhealthy remote cache compared to not using a remote cache.
Circuit breakers do not introduce any latency: When using a circuit breaker with a healthy remote cache, there is no additional latency, and the build time remains the same as when not using a circuit breaker.

Here are the summary of experiment result (failure_threshold 80 in 10s)

Unhealthy remote cache: Earlier we have reduced timeout for bf to 1s to handle unhealthy remote-cache. I increased remote timeout to 10s and there was 2-3m (within acceptable limit) increase in buildtime compared to no-remote cache. Within the first 2 minutes, the circuit tripped for 80% of the build shards.
Semi-healthy remote cache: With semi-healthy remote cache and 10s remote timeout there was 1-2m increase in build time compared to healthy remote-cache without circuit breaker. Approximately 40-60% of the build shards experienced circuit trips within the first 2 minutes of the build.
Healthy remote cache: With fully functional remote-cache and 10s remote timeout, circuit tripped only for 5-10% builds. Overall, this configuration yielded better results compared to the existing setup.

I experimented with different failure thresholds and timeouts. While a lower failure threshold is most suitable for handling an unhealthy condition, it resulted in an increased number of false positives and longer build times with healthy remote cache. On the other hand, a higher failure threshold delayed the circuit trip and prolonged the build time for an unhealthy remote cache. This suggests the need to implement a cooldown strategy for the circuit, allowing for a slightly lower threshold without impacting the build time when dealing with an unhealthy remote cache.

Furthermore, I verified that there were no circuit trips for "not_found" errors.

amishra-u · 2023-05-12T21:09:54Z

@werkt George can you please review this. I am sorry had to create a new diff due to rebase issue.

amishra-u · 2023-05-15T19:18:36Z

@shirchen

werkt

Seems pretty good short of some nits.

src/main/java/com/google/devtools/build/lib/remote/circuitbreaker/FailureCircuitBreaker.java

src/main/java/com/google/devtools/build/lib/remote/options/RemoteOptions.java

amishra-u · 2023-05-15T21:24:33Z

Seems pretty good short of some nits.

Incorporated feedback, please merge.

shirchen · 2023-05-18T06:03:21Z

@meteorcloudy @coeuvre Do you mind taking a look? Thanks!

src/test/java/com/google/devtools/build/lib/remote/RemoteModuleTest.java

amishra-u · 2023-05-22T19:16:45Z

@coeuvre Incorporated feedback, please take a look again.

coeuvre

Thanks!

amishra-u · 2023-05-23T16:50:33Z

Thanks!

Please merge it. I don't have permission.

xytan0056 · 2023-05-25T22:29:23Z

src/main/java/com/google/devtools/build/lib/remote/GrpcRemoteExecutor.java

@@ -252,4 +252,8 @@ public void close() {
    }
    channel.release();
  }
+
+  RemoteRetrier getRetrier() {
+    return this.retrier;


is this needed at all? I don't see this gets used anywhere or in the initerface

it is used test methods inside RemoteModuleTest class.

xytan0056 · 2023-05-25T22:45:54Z

src/main/java/com/google/devtools/build/lib/remote/circuitbreaker/FailureCircuitBreaker.java

+    if (!ignoredErrors.contains(e.getClass())) {
+      int failureCount = failures.incrementAndGet();
+      if (slidingWindowSize > 0) {
+        scheduledExecutor.schedule(failures::decrementAndGet, slidingWindowSize, TimeUnit.MILLISECONDS);


add a comment maybe? WhyincrementAndGet then decrement after a few milliseconds

Copy of bazelbuild#18120: I accidentally closed bazelbuild#18120 during rebase and doesn't have permission to reopen. We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented. To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window. In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit. Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency. Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy. closes bazelbuild#18136 Closes bazelbuild#18359. PiperOrigin-RevId: 536349954 Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704

* feat: Implement failure circuit breaker Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen. We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented. To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window. In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit. Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency. Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy. closes #18136 Closes #18359. PiperOrigin-RevId: 536349954 Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704 * remove target included in cherry-pick by mistake

Continuation of #18359 I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache. As I described [here](#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result. Issue related to the failure count: 1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold. 2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval. 3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache. Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios. Closes #18539. PiperOrigin-RevId: 538588379 Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528

Continuation of bazelbuild#18359 I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache. As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result. Issue related to the failure count: 1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold. 2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval. 3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache. Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios. Closes bazelbuild#18539. PiperOrigin-RevId: 538588379 Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528

When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service. Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS. This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure. Related PRs #18359 #18539 Closes #18613. PiperOrigin-RevId: 539948823 Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c

…#18559) * feat: Implement failure circuit breaker Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen. We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented. To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window. In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit. Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency. Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy. closes #18136 Closes #18359. PiperOrigin-RevId: 536349954 Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704 * remove target included in cherry-pick by mistake * Use failure_rate instead of failure count for circuit breaker --------- Co-authored-by: Ian (Hee) Cha <[email protected]>

Continuation of bazelbuild#18359 I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache. As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result. Issue related to the failure count: 1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold. 2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval. 3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache. Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios. Closes bazelbuild#18539. PiperOrigin-RevId: 538588379 Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528

When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service. Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS. This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure. Related PRs bazelbuild#18359 bazelbuild#18539 Closes bazelbuild#18613. PiperOrigin-RevId: 539948823 Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c

amishra-u added 2 commits May 9, 2023 17:35

feat: Implement failure circuit breaker

de3362b

add missing files

e32984a

amishra-u marked this pull request as ready for review May 10, 2023 18:32

amishra-u requested a review from a team as a code owner May 10, 2023 18:32

github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team labels May 10, 2023

amishra-u mentioned this pull request May 10, 2023

feat: Implement failure circuit breaker #18120

Closed

werkt approved these changes May 15, 2023

View reviewed changes

amishra-u added 2 commits May 15, 2023 13:51

incorporate feedback

60c71b6

remove final keyword

9be3204

Merge branch 'bazelbuild:master' into master

f949204

meteorcloudy requested a review from coeuvre May 19, 2023 13:52

coeuvre requested changes May 22, 2023

View reviewed changes

src/test/java/com/google/devtools/build/lib/remote/RemoteModuleTest.java Outdated Show resolved Hide resolved

amishra-u added 2 commits May 22, 2023 12:07

Remove parameterized test

00a317c

add missed test

038ec0b

coeuvre approved these changes May 23, 2023

View reviewed changes

coeuvre added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels May 23, 2023

xytan0056 reviewed May 25, 2023

View reviewed changes

amishra-u mentioned this pull request May 27, 2023

Use failure rate instead of failure count for circuit breaker amishra-u/bazel#1

Closed

copybara-service bot closed this in 5575ff2 May 30, 2023

sgowroji removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label May 30, 2023

amishra-u mentioned this pull request May 30, 2023

Use failure_rate instead of failure count for circuit breaker #18539

Closed

amishra-u mentioned this pull request May 30, 2023

[6.3.0] Implement failure circuit breaker #18541

Merged

This was referenced Jun 1, 2023

[6.3.0] Use failure_rate instead of failure count for circuit breaker #18559

Merged

Minor Update: Add out_of_range to ignored failure list for circuit_breaker #18583

Closed

amishra-u mentioned this pull request Jun 8, 2023

Update ignored_error logic for circuit_breaker #18613

Closed

amishra-u mentioned this pull request Jun 13, 2023

[6.3.0] Update ignored_error logic for circuit_breaker #18662

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement failure circuit breaker #18359

feat: Implement failure circuit breaker #18359

amishra-u commented May 10, 2023 •

edited

Loading

amishra-u commented May 10, 2023

amishra-u commented May 12, 2023

amishra-u commented May 15, 2023

werkt left a comment

amishra-u commented May 15, 2023

shirchen commented May 18, 2023

amishra-u commented May 22, 2023

coeuvre left a comment

amishra-u commented May 23, 2023

xytan0056 May 25, 2023

amishra-u May 25, 2023

xytan0056 May 25, 2023

feat: Implement failure circuit breaker #18359

feat: Implement failure circuit breaker #18359

Conversation

amishra-u commented May 10, 2023 • edited Loading

Issue

Solution

amishra-u commented May 10, 2023

amishra-u commented May 12, 2023

amishra-u commented May 15, 2023

werkt left a comment

Choose a reason for hiding this comment

amishra-u commented May 15, 2023

shirchen commented May 18, 2023

amishra-u commented May 22, 2023

coeuvre left a comment

Choose a reason for hiding this comment

amishra-u commented May 23, 2023

xytan0056 May 25, 2023

Choose a reason for hiding this comment

amishra-u May 25, 2023

Choose a reason for hiding this comment

xytan0056 May 25, 2023

Choose a reason for hiding this comment

amishra-u commented May 10, 2023 •

edited

Loading