Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation update on max_concurrency behaviour in download_blob due to urllib3 connection pool limit #38054

Closed
anuragverma65 opened this issue Oct 23, 2024 · 5 comments · Fixed by #38254
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)

Comments

@anuragverma65
Copy link

Type of issue

Missing information

Description

The Azure Storage SDK’s download_blob method allows users to set the max_concurrency parameter to enable parallel downloads for blobs larger than 64MB. By increasing max_concurrency, developers can potentially speed up blob downloads by using multiple connections simultaneously.

However, the underlying implementation of download_blob relies on urllib3, which has a default connection pool size of 10. When max_concurrency is set to a value higher than the default pool size, this triggers a warning:

Connection pool is full, discarding connection

This behaviour can lead to inefficiencies and confusion, as developers may assume max_concurrency controls the number of connections directly, without realising that the connection pool size needs to be adjusted accordingly.

Suggested Improvements:

1.	Documentation Update: It would be helpful if the official documentation for `download_blob` clearly stated that increasing `max_concurrency` beyond the default connection pool size requires explicitly configuring the connection pool (e.g., through requests.Session() or a similar method).
2.	Proactive Guidance: Adding a note or example in the documentation on how to configure the `BlobServiceClient` to adjust the pool size based on the intended `max_concurrency` would prevent potential issues and improve user experience.

This small clarification can prevent warnings and ensure that users get the expected performance when downloading large blobs with high concurrency.

Page URL

https://learn.microsoft.com/en-us/python/api/azure-storage-blob/azure.storage.blob.blobclient?view=azure-python#azure-storage-blob-blobclient-download-blob

Content source URL

https://github.com/MicrosoftDocs/azure-docs-sdk-python/blob/main/docs-ref-autogen/azure-storage-blob/azure.storage.blob.BlobClient.yml

Document Version Independent Id

9ee6555a-aaca-243f-409e-1ac5881e3dbc

Article author

@lmazuel

Metadata

  • ID: 2a557056-1da5-6c2d-fcee-e4e246a7a221
  • Service: azure-storage
@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files) labels Oct 23, 2024
Copy link

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jalauzon-msft @vincenttran-msft.

@weirongw23-msft weirongw23-msft self-assigned this Oct 29, 2024
@weirongw23-msft
Copy link
Member

Hi @anuragverma65 Anurag, thanks for bringing this to our attention. Could you please send us your sample code which triggers this warning: Connection pool is full, discarding connection?

@anuragverma65
Copy link
Author

anuragverma65 commented Oct 29, 2024

Hi @weirongw23-msft Thank you for looking into it. Sure here is a sample code which would trigger this warning (The blob should be large enough to trigger max concurrency, in my case, I tried it with a 1GB blob):

def download_blob(storage_account_url, credential, container_name, blob_name, max_concurrency=10):
    blob_service_client = BlobServiceClient(account_url=storage_account_url, credential=credential)
    blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)

    stream_downloader = blob_client.download_blob(max_concurrency=max_concurrency)

    with open("downloaded_blob", "wb") as file:
        stream_downloader.readinto(file)


if __name__ == "__main__":
    STORAGE_ACCOUNT_URL = "STORAGE_ACCOUNT_URL"
    CREDENTIAL = "CREDENTIAL"
    CONTAINER_NAME = "CONTAINER_NAME"
    BLOB_NAME = "BLOB_NAME"

    download_blob(STORAGE_ACCOUNT_URL, CREDENTIAL, CONTAINER_NAME, BLOB_NAME, max_concurrency=20)

If we set the concurrency to 10 it does not trigger the warning because it adheres to the default urllib3 connection pool limit.

To handle this issue, I am creating a session with increased pool_maxsize equal to the max_concurrency

session = requests.Session()
adapter = requests.adapters.HTTPAdapter(pool_maxsize=20)
session.mount("https://", adapter)

and passing it to the the BlobServiceClient

@weirongw23-msft
Copy link
Member

Hi @anuragverma65 Anurag, thank you for being patient and for your feedback. We've updated the documentation to include the connection pool note for all upload/download APIs across the Storage SDK, and we will consider providing a sample for how to configure the underlying connection pool size in the future.

@anuragverma65
Copy link
Author

@weirongw23-msft Peter, thank you so much for your quick resolution of this issue!

@github-actions github-actions bot locked and limited conversation to collaborators Feb 3, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team. Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
2 participants