-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry policies don't sleep after operations time out #18548
Conversation
@@ -156,6 +158,9 @@ async def send(self, request): # pylint:disable=invalid-overridden-method | |||
# the authentication policy failed such that the client's request can't | |||
# succeed--we'll never have a response to it, so propagate the exception | |||
raise | |||
except (ServiceRequestTimeoutError, ServiceResponseTimeoutError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this scenario, timeout setting is 10 min and default timeout is 5 min.
We should fail on the second TimeoutError but not the first one, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean by a first vs. second error. _configure_timeout
raises one of these errors when absolute_timeout <= 0
. I believe we never want to retry when that's true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant (ServiceRequestTimeoutError, ServiceResponseTimeoutError) might be raised from transport. In this case, absolute_timeout may still >0 and we do want to retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing else in the library raises these errors today but it's a fair point that a transport or policy could do so. I've removed this except block in favor of checking absolute_timeout
in the broader handler.
@@ -455,7 +455,7 @@ def send(self, request): | |||
# succeed--we'll never have a response to it, so propagate the exception | |||
raise | |||
except AzureError as err: | |||
if self._is_method_retryable(retry_settings, request.http_request): | |||
if absolute_timeout > 0 and self._is_method_retryable(retry_settings, request.http_request): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If absolute_timeout <= 0, we already raised in https://github.com/Azure/azure-sdk-for-python/blob/master/sdk/core/azure-core/azure/core/pipeline/policies/_retry.py#L295?
Not true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but the problem is that we handle that error here by sleeping and iterating again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might mis-understand, why absolute_timeout can be <= 0 here?
Do you want to move
end_time = time.time()
if absolute_timeout:
absolute_timeout -= (end_time - start_time)
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is tricky to refactor without breaking something. For now we need to decrement absolute_timeout in the finally
block so it's done after every iteration, including ones that end with the continue
in the try
block. I'm picking low hanging fruit in this PR. I think fixing everything (e.g. #18552) will require more thought and a larger change.
As for how absolute_timeout can be non-positive here, note that nothing breaks the loop when it is. _configure_timeout raises an exception when absolute_timeout <= 0 but this except
block handles that exception because it's an AzureError. If the request is retryable, the except
block sleeps and we go back to the try
block, where _configure_timeout
raises again. Repeat until retries are exhausted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Please update changelog. :)
Review request for Microsoft.ContainerService to add version 2022-03-02-preview (Azure#18633) * Adds base for updating Microsoft.ContainerService from version preview/2022-02-02-preview to version 2022-03-02-preview * Updates readme * Updates API version in new specs and examples * feat: add `/rotateServiceAccountSigningKeys` API (Azure#18359) * feat: add `/rotateServiceAccountSigningKeys` API * chore: remove duplicated description * feat: add `workloadIdentity` settings to `SecurityProfile` (Azure#18360) * fix: update example name (Azure#18382) * add creationData to mc data (Azure#18414) * add creationData to mc data * fix test * fix format * Add API properties and example JSON for Web App Routing of IngressProfile (Azure#18564) * Add API properties and example JSON for Web App Routing of IngressProfile. * Add a ending period for description to match the style in all other "descriptions" in the same file. * Update readmes for the 2022-03-02-preview dev branch of container service (Azure#18358) * update readme * update sdk readmes * Agentpool alias minor version 2022-03-02-preview (Azure#18381) * Add field currentOrchestratorVersion to support Agentpool alias minor version * Add new exmaple for alias minor version * Remove example for PrettierCheck * Fix typo * Latest patch version supported is 1.22.7 at the moment * Address comments - refine descriptions for fields * feat: add ManagedCluster StorageProfile in 0302preview (Azure#18590) Signed-off-by: Ji An Liu <[email protected]> * Swagger change for ignore-pod-disruption-budget (Azure#18548) * Swagger change for ignore-pod-disruption-budget * Change ignorePodDisruptionBudget to string in example file. * Change ignorePodDisruptionBudget to boolean type. * add effectiveNoProxy for AKS (Azure#18544) * add effectiveNoProxy for AKS * add to correct api * fix lowercase -> uppercase O * spelling * Replace common type definitions with references since 2022-03-02-preview for Microsoft.ContainerService (Azure#18567) * replace Resource * replace SystemData * replace parameters * change track2 to python Co-authored-by: hbc <[email protected]> Co-authored-by: Qingqing <[email protected]> Co-authored-by: Yi Zhang <[email protected]> Co-authored-by: Thalia Wang <[email protected]> Co-authored-by: Ji'an Liu <[email protected]> Co-authored-by: Tong Chen <[email protected]> Co-authored-by: Ace Eldeib <[email protected]>
After an operation times out,
_configure_timeout
raises a subclass of AzureError. Async/RetryPolicy handles AzureError by sleeping and continuing the retry_active loop. The policy won't actually send another request, because_configure_timeout
will raise at the start of every loop iteration, but it will pointlessly sleep after each iteration until it exhausts its allowed retries.This PR prevents this pointless sleeping by re-raising
_configure_timeout
exceptions. This is safe today because no other policy raises those exceptions. For example, transports raise ServiceRequest/ResponseError when connections time out. However, in principle something downstream from RetryPolicy could raise e.g. ServiceResponseTimeoutError given a request that should be retried; with this change, the request would not be retried. I don't expect that scenario, but I point out the possibility so you can disagree with me as you like 😸