Whenever an operation is executed through the .NET SQL SDK, a public API is invoked. The API will leverage the ClientContext to create the Diagnostics scope for the operation (and any retries involved) and create the RequestMessage.
The handler pipeline is used to process and handle the RequestMessage and perform actions like handling retries including any custom user handler added through CosmosClientOptions.CustomHandlers
. See the pipeline section for more details.
At the end of the pipeline, the request is sent to the transport layer, which will process the request depending on the CosmosClientOptions.ConnectionMode
and use gateway or direct connectivity mode to reach to the Azure Cosmos DB service.
flowchart LR
PublicApi[Public API]
PublicApi <--> ClientContext[Client Context]
ClientContext <--> Pipeline[Handler Pipeline]
Pipeline <--> TransportClient[Configured Transport]
TransportClient <--> Service[(Cosmos DB Service)]
The handler pipeline processes the RequestMessage and each handler can choose to augment it in different ways, as shown in our handler samples and also handle certain error conditions and retry, like our own RetryHandler. The RetryHandler will handle any failures from the Transport layer that should be handled as regional failovers.
The default pipeline structure is:
flowchart
RequestInvokerHandler <--> UserHandlers([Any User defined handlers])
UserHandlers <--> DiagnosticHandler
DiagnosticHandler <--> TelemetryHandler
TelemetryHandler <--> RetryHandler[Cross region retries]
RetryHandler <--> ThrottlingRetries[Throttling retries]
ThrottlingRetries <--> RouteHandler
RouteHandler <--> IsPartitionedFeedOperation{{Partitioned Feed operation?}}
IsPartitionedFeedOperation <-- No --> TransportHandler
IsPartitionedFeedOperation <-- Yes --> InvalidPartitionExceptionRetryHandler
InvalidPartitionExceptionRetryHandler <--> PartitionKeyRangeHandler
PartitionKeyRangeHandler <--> TransportHandler
TransportHandler <--> TransportClient[[Selected Transport]]
Any HTTP response, with a status code 429
from the service means the current operation is being rate limited and it's handled by the RetryHandler through the ResourceThrottleRetryPolicy.
The policy will retry the operation using the delay indicated in the x-ms-retryafter
response header up to the maximum configured in CosmosClientOptions.MaxRetryAttemptsOnRateLimitedRequests
with a default value of 9.
Any failure response from the Transport that matches the conditions for cross-regional communication are handled by the RetryHandler through the ClientRetryPolicy.
CancellationToken
passed down as input to the Public API can stop these retries.
- HTTP connection failures or DNS resolution problems (HttpRequestException) - The account information is refreshed, the current region is marked unavailable to be used, and the request is retried on the next available region after account refresh if the account has multiple regions. In case no other regions are available, the SDK keeps retrying by refreshing the account information up to a maximum of times.
- HTTP 403 with Substatus 3 - The current region is no longer a Write region (write region failover), the account information is refreshed, and the request is retried on the new Write region.
- HTTP 403 with Substatus 1008 - The current region is not available (adding or removing a region). The region is marked as unavailable for 5 minutes and the request retried on the next available region.
- HTTP 404 with Substatus 1002 - Session consistency request where the region did not yet receive the requested Session Token, the request is retried on the primary (write) region for accounts with a single write region or on the next preferred region for accounts with multiple write regions.
- HTTP 503 - The request could not succeed on the region due to repeated TCP connectivity issues, the request is retried on the next preferred region.
Once a RequestMessage reaches the TransportHandler it will be sent through either the GatewayStoreModel for HTTP requests and Gateway mode clients or through the ServerStoreModel for clients configured with Direct mode.
Even on clients configured on Direct mode, there can be HTTP requests that get routed to the Gateway. The ConnectionMode
defined in the CosmosClientOptions
affects data-plane operations (operations related to Items, like CRUD or query over existing Items in a Container) but metadata/control-plane operations (that appear as MetadataRequests on Azure Monitor) are sent through HTTP to the Gateway.
The ServerStoreModel contains the Direct connectivity stack, which takes care of discovering, for each operation, which is the physical partition to route to and which replica/s should be contacted. The Direct connectivity stack includes a retry layer, a consistency component and the TCP protocol implementation.
The GatewayStoreModel connects to the Cosmos DB Gateway and sends HTTP requests through our CosmosHttpClient, which just wraps the HttpClient
through a retry layer to handle transient timeouts.
flowchart
Start{{HTTP or Direct operation?}}
Start <-- HTTP --> GatewayStoreModel
Start <-- Direct --> ServerStoreModel
TCPClient <-- TCP --> R1
TCPClient <-- TCP --> R17
TCPClient <-- TCP --> R20
GatewayService <-- TCP --> R6
GatewayService <-- TCP --> R3
GatewayService <-- TCP --> R2
ServerStoreModel <--> StoreClient
StoreClient <--> ReplicatedResourceClient
ReplicatedResourceClient <-- Retries using --> DirectRetry[[Direct mode retry layer]]
DirectRetry <--> Consistency[[Consistency stack]]
Consistency <--> TCPClient[TCP implementation]
GatewayStoreModel <--> GatewayStoreClient
GatewayStoreClient <--> CosmosHttpClient
CosmosHttpClient <-- Retries using --> HttpTimeoutPolicy[[HTTP retry layer]]
HttpTimeoutPolicy <-- HTTPS --> GatewayService
subgraph Service
subgraph Partition1
R1[(Replica 1)]
R2[(Replica 2)]
R6[(Replica 6)]
R8[(Replica 8)]
end
subgraph Partition2
R3[(Replica 3)]
R20[(Replica 20)]
R10[(Replica 10)]
R17[(Replica 17)]
end
GatewayService[Gateway Service]
end
The HttpClient
is wrapped around a CosmosHttpClient which employs an HttpTimeoutPolicy to retry if the request has a transient failure (timeout) or if it takes longer than expected. Requests are canceled if latency is higher than expected to account for transient network delays (retrying would be faster than waiting for the request to fail) and for scenarios where the Cosmos DB Gateway is performing rollout upgrades on their endpoints.
The different HTTP retry policies are:
- HttpTimeoutPolicyControlPlaneRetriableHotPath for control plane (metadata) operations involved in a data plane hot path (like obtaining the Query Plan for a query operation).
- HttpTimeoutPolicyControlPlaneRead for control plane (metadata) reads outside the hot path (like initialization or reading the account information).
- HttpTimeoutPolicyNoRetry currently only used for Client Telemetry.
- HttpTimeoutPolicyDefault used for data-plane item interactions (like CRUD) when the SDK is configured on Gateway mode.
flowchart
GatewayStoreModel --> GatewayStoreClient
GatewayStoreClient -- selects --> HttpTimeoutPolicy
HttpTimeoutPolicy -- sends request through --> CosmosHttpClient
CosmosHttpClient <-- HTTPS --> GatewayService[(Gateway Service)]
GatewayService --> IsSuccess{{Request succeeded?}}
IsSuccess -- No --> IsRetriable{{Transient retryable error / reached max latency?}}
IsRetriable -- Yes --> HttpTimeoutPolicy
Direct connectivity is obtained from the Microsoft.Azure.Cosmos.Direct
package reference.
The below code links are to an example branch in this repository that contains the
Microsoft.Azure.Cosmos.Direct
source code, but that branch might not be updated with the latest source code.
The ServerStoreModel uses the StoreClient to execute Direct operations. The StoreClient
is used to capture session token updates (in case of Session consistency) and calls the ReplicatedResourceClient which wraps the request into the GoneAndRetryWithRequestRetryPolicy which retries for up to 30 seconds (or up to the user CancellationToken).
CancellationToken
passed down as input to the Public API can stop these retries.
If the retry period (30 seconds) is exhausted, it returns an HTTP 503 ServiceUnavailable error to the caller. Takes care of handling:
- HTTP 410 Substatus 0 (replica moved) responses from the service or TCP timeouts (connect or send timeouts) -> Refreshes partition addresses and retries.
- HTTP 410 Substatus 1008 (partition is migrating) responses from the service -> Refreshes partition map (rediscovers partitions) and retries.
- HTTP 410 Substatus 1007 (partition is splitting) responses from the service -> Refreshes partition map (rediscovers partitions) and retries.
- HTTP 410 Substatus 1000 responses from the service, up to 3 times -> Refreshes container map (for cases when the container was recreated with the same name) and retries.
ℹ️ There is no delay for the first retry. Further retries start with 1 second and using a backoff multiplier of 2 can go up to 15 seconds.
- HTTP 449 responses from the service -> Retries using a random salted period.
ℹ️ There is no delay for the first retry. Further retries start with 10 milliseconds and using a backoff multiplier of 2 with a 5 millisecond salt can go up to 1 second.
flowchart
ServerStoreModel --> StoreClient
StoreClient --> ReplicatedResourceClient
ReplicatedResourceClient -- uses --> GoneAndRetryWithRequestRetryPolicy
GoneAndRetryWithRequestRetryPolicy --> Consistency[[Consistency stack]]
Consistency --> TCPClient[TCP implementation]
TCPClient --> Replica[(Replica X)]
Replica --> IsSuccess{{Request succeeded?}}
IsSuccess -- No --> IsRetryable{{Is retryable condition?}}
IsRetryable -- Yes --> GoneAndRetryWithRequestRetryPolicy
Per our connectivity documentation, the SDK will store, in internal caches, critical information to allow for request routing.
For details on the caches, please see the cache design documentation.
When performing operations through Direct mode, the SDK is involved in checking consistency for Bounded Staleness and Strong accounts. Read requests are handled by the ConsistencyReader and write requests are handled by the ConsistencyWriter. The ConsistencyReader
uses the QuorumReader when the consistency is Bounded Staleness or Strong to verify quorum after performing two requests and comparing the LSNs. If quorum cannot be achieved, the SDK starts what is defined as "barrier requests" to the container and waits for it to achieve quorum. The ConsistencyWriter
also performs a similar LSN check after receiving the response from the write, the GlobalCommittedLSN
and the item LSN
. If they don't match, barrier requests are also performed.
flowchart LR
ReplicatedResourceClient --> RequestType{{Is it a read request?}}
RequestType -- Yes --> ConsistencyReader
RequestType -- No --> ConsistencyWriter
ConsistencyReader --> QuorumReader
QuorumReader --> TCPClient[TCP implementation]
ConsistencyWriter --> TCPClient