-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement failure circuit breaker #18359
Conversation
I ran multiple experiments with unhealthy remote cache, semi-healthy remote cache and healthy remote cache on full monorepo build, with the following goals.
Here are the summary of experiment result (failure_threshold 80 in 10s)
I experimented with different failure thresholds and timeouts. While a lower failure threshold is most suitable for handling an Furthermore, I verified that there were no circuit trips for "not_found" errors. |
@werkt George can you please review this. I am sorry had to create a new diff due to rebase issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems pretty good short of some nits.
src/main/java/com/google/devtools/build/lib/remote/circuitbreaker/FailureCircuitBreaker.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/remote/circuitbreaker/FailureCircuitBreaker.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/remote/options/RemoteOptions.java
Outdated
Show resolved
Hide resolved
Incorporated feedback, please merge. |
@meteorcloudy @coeuvre Do you mind taking a look? Thanks! |
src/test/java/com/google/devtools/build/lib/remote/RemoteModuleTest.java
Outdated
Show resolved
Hide resolved
@coeuvre Incorporated feedback, please take a look again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Please merge it. I don't have permission. |
@@ -252,4 +252,8 @@ public void close() { | |||
} | |||
channel.release(); | |||
} | |||
|
|||
RemoteRetrier getRetrier() { | |||
return this.retrier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed at all? I don't see this gets used anywhere or in the initerface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is used test methods inside RemoteModuleTest class.
if (!ignoredErrors.contains(e.getClass())) { | ||
int failureCount = failures.incrementAndGet(); | ||
if (slidingWindowSize > 0) { | ||
scheduledExecutor.schedule(failures::decrementAndGet, slidingWindowSize, TimeUnit.MILLISECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment maybe? WhyincrementAndGet
then decrement after a few milliseconds
Copy of bazelbuild#18120: I accidentally closed bazelbuild#18120 during rebase and doesn't have permission to reopen. We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented. To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window. In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit. Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency. Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy. closes bazelbuild#18136 Closes bazelbuild#18359. PiperOrigin-RevId: 536349954 Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704
* feat: Implement failure circuit breaker Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen. We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented. To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window. In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit. Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency. Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy. closes #18136 Closes #18359. PiperOrigin-RevId: 536349954 Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704 * remove target included in cherry-pick by mistake
Continuation of #18359 I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache. As I described [here](#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result. Issue related to the failure count: 1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold. 2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval. 3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache. Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios. Closes #18539. PiperOrigin-RevId: 538588379 Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528
Continuation of bazelbuild#18359 I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache. As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result. Issue related to the failure count: 1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold. 2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval. 3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache. Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios. Closes bazelbuild#18539. PiperOrigin-RevId: 538588379 Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528
When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service. Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS. This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure. Related PRs #18359 #18539 Closes #18613. PiperOrigin-RevId: 539948823 Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c
…#18559) * feat: Implement failure circuit breaker Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen. We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented. To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window. In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit. Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency. Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy. closes #18136 Closes #18359. PiperOrigin-RevId: 536349954 Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704 * remove target included in cherry-pick by mistake * Use failure_rate instead of failure count for circuit breaker --------- Co-authored-by: Ian (Hee) Cha <[email protected]>
Continuation of bazelbuild#18359 I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache. As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result. Issue related to the failure count: 1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold. 2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval. 3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache. Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the approach to use the failure rate, and easily found a configuration that worked effectively in both scenarios. Closes bazelbuild#18539. PiperOrigin-RevId: 538588379 Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528
When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service. Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS. This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure. Related PRs bazelbuild#18359 bazelbuild#18539 Closes bazelbuild#18613. PiperOrigin-RevId: 539948823 Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c
Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen.
Issue
We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.
Solution
To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.
In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.
Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.
Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.
closes #18136