Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry policies don't sleep after operations time out #18548

Merged
merged 3 commits into from
May 25, 2021

Conversation

chlowell
Copy link
Member

@chlowell chlowell commented May 6, 2021

After an operation times out, _configure_timeout raises a subclass of AzureError. Async/RetryPolicy handles AzureError by sleeping and continuing the retry_active loop. The policy won't actually send another request, because _configure_timeout will raise at the start of every loop iteration, but it will pointlessly sleep after each iteration until it exhausts its allowed retries.

This PR prevents this pointless sleeping by re-raising _configure_timeout exceptions. This is safe today because no other policy raises those exceptions. For example, transports raise ServiceRequest/ResponseError when connections time out. However, in principle something downstream from RetryPolicy could raise e.g. ServiceResponseTimeoutError given a request that should be retried; with this change, the request would not be retried. I don't expect that scenario, but I point out the possibility so you can disagree with me as you like 😸

@@ -156,6 +158,9 @@ async def send(self, request): # pylint:disable=invalid-overridden-method
# the authentication policy failed such that the client's request can't
# succeed--we'll never have a response to it, so propagate the exception
raise
except (ServiceRequestTimeoutError, ServiceResponseTimeoutError):
Copy link
Member

@xiangyan99 xiangyan99 May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this scenario, timeout setting is 10 min and default timeout is 5 min.

We should fail on the second TimeoutError but not the first one, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean by a first vs. second error. _configure_timeout raises one of these errors when absolute_timeout <= 0. I believe we never want to retry when that's true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant (ServiceRequestTimeoutError, ServiceResponseTimeoutError) might be raised from transport. In this case, absolute_timeout may still >0 and we do want to retry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing else in the library raises these errors today but it's a fair point that a transport or policy could do so. I've removed this except block in favor of checking absolute_timeout in the broader handler.

@@ -455,7 +455,7 @@ def send(self, request):
# succeed--we'll never have a response to it, so propagate the exception
raise
except AzureError as err:
if self._is_method_retryable(retry_settings, request.http_request):
if absolute_timeout > 0 and self._is_method_retryable(retry_settings, request.http_request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but the problem is that we handle that error here by sleeping and iterating again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might mis-understand, why absolute_timeout can be <= 0 here?

Do you want to move

end_time = time.time() 
if absolute_timeout:
    absolute_timeout -= (end_time - start_time)

here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is tricky to refactor without breaking something. For now we need to decrement absolute_timeout in the finally block so it's done after every iteration, including ones that end with the continue in the try block. I'm picking low hanging fruit in this PR. I think fixing everything (e.g. #18552) will require more thought and a larger change.

As for how absolute_timeout can be non-positive here, note that nothing breaks the loop when it is. _configure_timeout raises an exception when absolute_timeout <= 0 but this except block handles that exception because it's an AzureError. If the request is retryable, the except block sleeps and we go back to the try block, where _configure_timeout raises again. Repeat until retries are exhausted.

Copy link
Member

@xiangyan99 xiangyan99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Please update changelog. :)

@chlowell chlowell merged commit f9aa2ea into Azure:master May 25, 2021
@chlowell chlowell deleted the no-sleep-after-timeout branch May 25, 2021 15:26
rakshith91 pushed a commit to rakshith91/azure-sdk-for-python that referenced this pull request Jun 1, 2021
azure-sdk pushed a commit to azure-sdk/azure-sdk-for-python that referenced this pull request Apr 14, 2022
Review request for Microsoft.ContainerService to add version 2022-03-02-preview (Azure#18633)

* Adds base for updating Microsoft.ContainerService from version preview/2022-02-02-preview to version 2022-03-02-preview

* Updates readme

* Updates API version in new specs and examples

* feat: add `/rotateServiceAccountSigningKeys` API (Azure#18359)

* feat: add `/rotateServiceAccountSigningKeys` API

* chore: remove duplicated description

* feat: add `workloadIdentity` settings to `SecurityProfile` (Azure#18360)

* fix: update example name (Azure#18382)

* add creationData to mc data (Azure#18414)

* add creationData to mc data

* fix test

* fix format

* Add API properties and example JSON for Web App Routing of IngressProfile (Azure#18564)

* Add API properties and example JSON for Web App Routing of IngressProfile.

* Add a ending period for description to match the style in all other "descriptions" in the same file.

* Update readmes for the 2022-03-02-preview dev branch of container service (Azure#18358)

* update readme

* update sdk readmes

* Agentpool alias minor version 2022-03-02-preview (Azure#18381)

* Add field currentOrchestratorVersion to support Agentpool alias minor version

* Add new exmaple for alias minor version

* Remove example for PrettierCheck

* Fix typo

* Latest patch version supported is 1.22.7 at the moment

* Address comments - refine descriptions for fields

* feat: add ManagedCluster StorageProfile in 0302preview (Azure#18590)

Signed-off-by: Ji An Liu <[email protected]>

* Swagger change for ignore-pod-disruption-budget (Azure#18548)

* Swagger change for ignore-pod-disruption-budget

* Change ignorePodDisruptionBudget to string in example file.

* Change ignorePodDisruptionBudget to boolean type.

* add effectiveNoProxy for AKS (Azure#18544)

* add effectiveNoProxy for AKS

* add to correct api

* fix lowercase -> uppercase O

* spelling

* Replace common type definitions with references since 2022-03-02-preview for Microsoft.ContainerService  (Azure#18567)

* replace Resource

* replace SystemData

* replace parameters

* change track2 to python

Co-authored-by: hbc <[email protected]>
Co-authored-by: Qingqing <[email protected]>
Co-authored-by: Yi Zhang <[email protected]>
Co-authored-by: Thalia Wang <[email protected]>
Co-authored-by: Ji'an Liu <[email protected]>
Co-authored-by: Tong Chen <[email protected]>
Co-authored-by: Ace Eldeib <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants