Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement failure circuit breaker #18359

Closed
wants to merge 7 commits into from

Conversation

amishra-u
Copy link
Contributor

@amishra-u amishra-u commented May 10, 2023

Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen.

Issue

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

Solution

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes #18136

@amishra-u
Copy link
Contributor Author

#18120 (comment)

I ran multiple experiments with unhealthy remote cache, semi-healthy remote cache and healthy remote cache on full monorepo build, with the following goals.

  1. The build time remains reasonably equivalent when using a circuit breaker and an unhealthy remote cache compared to not using a remote cache.
  2. Circuit breakers do not introduce any latency: When using a circuit breaker with a healthy remote cache, there is no additional latency, and the build time remains the same as when not using a circuit breaker.

Here are the summary of experiment result (failure_threshold 80 in 10s)

  1. Unhealthy remote cache: Earlier we have reduced timeout for bf to 1s to handle unhealthy remote-cache. I increased remote timeout to 10s and there was 2-3m (within acceptable limit) increase in buildtime compared to no-remote cache. Within the first 2 minutes, the circuit tripped for 80% of the build shards.
  2. Semi-healthy remote cache: With semi-healthy remote cache and 10s remote timeout there was 1-2m increase in build time compared to healthy remote-cache without circuit breaker. Approximately 40-60% of the build shards experienced circuit trips within the first 2 minutes of the build.
  3. Healthy remote cache: With fully functional remote-cache and 10s remote timeout, circuit tripped only for 5-10% builds. Overall, this configuration yielded better results compared to the existing setup.

I experimented with different failure thresholds and timeouts. While a lower failure threshold is most suitable for handling an unhealthy condition, it resulted in an increased number of false positives and longer build times with healthy remote cache. On the other hand, a higher failure threshold delayed the circuit trip and prolonged the build time for an unhealthy remote cache. This suggests the need to implement a cooldown strategy for the circuit, allowing for a slightly lower threshold without impacting the build time when dealing with an unhealthy remote cache.

Furthermore, I verified that there were no circuit trips for "not_found" errors.

@amishra-u amishra-u marked this pull request as ready for review May 10, 2023 18:32
@amishra-u amishra-u requested a review from a team as a code owner May 10, 2023 18:32
@github-actions github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team labels May 10, 2023
@amishra-u
Copy link
Contributor Author

@werkt George can you please review this. I am sorry had to create a new diff due to rebase issue.

@amishra-u
Copy link
Contributor Author

@shirchen

Copy link
Contributor

@werkt werkt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems pretty good short of some nits.

@amishra-u
Copy link
Contributor Author

Seems pretty good short of some nits.

Incorporated feedback, please merge.

@shirchen
Copy link

@meteorcloudy @coeuvre Do you mind taking a look? Thanks!

@meteorcloudy meteorcloudy requested a review from coeuvre May 19, 2023 13:52
@amishra-u
Copy link
Contributor Author

@coeuvre Incorporated feedback, please take a look again.

Copy link
Member

@coeuvre coeuvre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@coeuvre coeuvre added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels May 23, 2023
@amishra-u
Copy link
Contributor Author

Thanks!

Please merge it. I don't have permission.

@@ -252,4 +252,8 @@ public void close() {
}
channel.release();
}

RemoteRetrier getRetrier() {
return this.retrier;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed at all? I don't see this gets used anywhere or in the initerface

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is used test methods inside RemoteModuleTest class.

if (!ignoredErrors.contains(e.getClass())) {
int failureCount = failures.incrementAndGet();
if (slidingWindowSize > 0) {
scheduledExecutor.schedule(failures::decrementAndGet, slidingWindowSize, TimeUnit.MILLISECONDS);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment maybe? WhyincrementAndGet then decrement after a few milliseconds

@sgowroji sgowroji removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label May 30, 2023
amishra-u added a commit to amishra-u/bazel that referenced this pull request May 30, 2023
Copy of bazelbuild#18120: I accidentally closed bazelbuild#18120 during rebase and doesn't have permission to reopen.

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes bazelbuild#18136

Closes bazelbuild#18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704
iancha1992 pushed a commit that referenced this pull request Jun 1, 2023
* feat: Implement failure circuit breaker

Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen.

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes #18136

Closes #18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704

* remove target included in cherry-pick by mistake
copybara-service bot pushed a commit that referenced this pull request Jun 7, 2023
Continuation of #18359
I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache.
As I described [here](#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result.

Issue related to the failure count:
1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold.
2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval.
3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache.

Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the  approach to use the failure rate, and easily found a configuration  that worked effectively in both scenarios.

Closes #18539.

PiperOrigin-RevId: 538588379
Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528
amishra-u added a commit to amishra-u/bazel that referenced this pull request Jun 7, 2023
Continuation of bazelbuild#18359
I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache.
As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result.

Issue related to the failure count:
1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold.
2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval.
3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache.

Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the  approach to use the failure rate, and easily found a configuration  that worked effectively in both scenarios.

Closes bazelbuild#18539.

PiperOrigin-RevId: 538588379
Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528
copybara-service bot pushed a commit that referenced this pull request Jun 13, 2023
When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service.
Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS.

This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure.

Related PRs
#18359
#18539

Closes #18613.

PiperOrigin-RevId: 539948823
Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c
iancha1992 added a commit that referenced this pull request Jun 13, 2023
…#18559)

* feat: Implement failure circuit breaker

Copy of #18120: I accidentally closed #18120 during rebase and doesn't have permission to reopen.

We have noticed that any problems with the remote cache have a detrimental effect on build times. On investigation we found that the interface for the circuit breaker was left unimplemented.

To address this issue, implemented a failure circuit breaker, which includes three new Bazel flags: 1) experimental_circuitbreaker_strategy, 2) experimental_remote_failure_threshold, and 3) experimental_emote_failure_window.

In this implementation, I have implemented failure strategy for circuit breaker and used failure count to trip the circuit.

Reasoning behind using failure count instead of failure rate : To measure failure rate I also need the success count. While both the failure and success count need to be an AtomicInteger as both will be modified concurrently by multiple threads. Even though getAndIncrement is very light weight operation, at very high request it might contribute to latency.

Reasoning behind using failure circuit breaker : A new instance of Retrier.CircuitBreaker is created for each build. Therefore, if the circuit breaker trips during a build, the remote cache will be disabled for that build. However, it will be enabled again
for the next build as a new instance of Retrier.CircuitBreaker will be created. If needed in the future we may add cool down strategy also. e.g. failure_and_cool_down_startegy.

closes #18136

Closes #18359.

PiperOrigin-RevId: 536349954
Change-Id: I5e1c57d4ad0ce07ddc4808bf1f327bc5df6ce704

* remove target included in cherry-pick by mistake

* Use failure_rate instead of failure count for circuit breaker

---------

Co-authored-by: Ian (Hee) Cha <[email protected]>
traversaro pushed a commit to traversaro/bazel that referenced this pull request Jun 24, 2023
Continuation of bazelbuild#18359
I ran multiple experiment and tried to find optimal failure threshold and failure window interval with different remote_timeout, for healthy remote cache, semi-healthy (overloaded) remote cache and unhealthy remote cache.
As I described [here](bazelbuild#18359 (comment)) even with healthy remote cache there was 5-10% circuit trip and we were not getting the best result.

Issue related to the failure count:
1. When the remote cache is healthy, builds are fast, and Bazel makes a high number of calls to the buildfarm. As a result, even with a moderate failure rate, the failure count may exceed the threshold.
2. Additionally, write calls, which have a higher probability of failure compared to other calls, are batched immediately after the completion of an action's build. This further increases the chances of breaching the failure threshold within the defined window interval.
3. On the other hand, when the remote cache is unhealthy or semi-healthy, builds are significantly slowed down, and Bazel makes fewer calls to the remote cache.

Finding a configuration that works well for both healthy and unhealthy remote caches was not feasible. Therefore, changed the  approach to use the failure rate, and easily found a configuration  that worked effectively in both scenarios.

Closes bazelbuild#18539.

PiperOrigin-RevId: 538588379
Change-Id: I64a49eeeb32846d41d54ca3b637ded3085588528
traversaro pushed a commit to traversaro/bazel that referenced this pull request Jun 24, 2023
When the digest size exceeds the max configured digest size by remote-cache, an "out_of_range" error is returned. These errors should not be considered as API failures for the circuit breaker logic, as they do not indicate any issues with the remote-cache service.
Similarly there are other non-retriable errors that should not be treated as server failure such as ALREADY_EXISTS.

This change considers non-retriable errors as user/client error and logs them as success. While retriable errors such `DEADLINE_EXCEEDED`, `UNKNOWN` etc are logged as failure.

Related PRs
bazelbuild#18359
bazelbuild#18539

Closes bazelbuild#18613.

PiperOrigin-RevId: 539948823
Change-Id: I5b51f6a3aecab7c17d73f78b8234d9a6da49fe6c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-Remote-Exec Issues and PRs for the Execution (Remote) team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Implementation for Circuit breaker
6 participants