-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destination data has significantly larger billable size than the source data [for sparse page blobs] #391
Comments
@hpaul-osi Thanks for your detailed description. I suspect your diagnosis of the issue is correct, but I'll need to check with some colleagues to be sure. It's a tricky one to deal with, because when copying between accounts (or containers) AzCopy v10 doesn't actually see the content of the blobs. We just tell the destination to pull it directly from the source. That's great for throughput, but means that AzCopy itself can't see if a particular block is actually all zeros and so should not be copied. We'll give it some thought! In the meantime, the only workaround that I can think of in AzCopy v10 is to download the blobs to a file system (i.e. disk) then upload them. That's a bit of a nuisance, because it's a two step process. And it becomes difficult for very large payloads. If you need to try it, remember to put --blob-type PageBlob on the upload command line, otherwise they'll end up as block blobs! |
Thank you for the quick response and the caution regarding the blob type. Unfortunately, our main driver for leveraging AzCopy for this was the ability to replicate data from one storage account to another directly without ever having to handle it on-prem. Pulling and pushing will make the process too slow with our expected volumes. Since AzCopy is capable of doing this in two steps, hopefully a future version will be able to handle it in one. I can see where the current behavior is advantageous since AzCopy doesn't need to know anything about the payload. Alternatively, it would also be great for throughput if AzCopy only had to send a small fraction of the data. I understand that this may not make sense for a default behavior, but an option to handle sparse blobs could save a lot on both bandwidth for the transfer and storage on the destination. |
Yes, I totally agree with your points. BTW, if you do want/need to try the two stage copy, there's no need to move it through an on prem disk. Instead, you can use an Azure VM for that purpose. You'd need to pick one with sufficient network bandwidth, and ensure you have sufficient disk space to hold the full size of the blobs (i.e. the size as you currently see it at your destination). I'm not sure exactly whether the most optimal approach would be to use the temp drive or managed disks, but I can find out for you if you want. I do know that one of the new big managed disks (8, 16 or 32 TB) should work fine. What you should end up with is only seeing the "big" size on the VM disk, and the destination size should equal the source size. I don't know whether this is a good option for you, but just mentioning it in case. Finally, I have just heard back from a senior colleague. He has suggested a way that, in a future release, AzCopy could detect which pages actually contain content, and only move them. I'll put that work on our backlog now, but can't give you any ETA for a release at this stage. |
Good point re Azure hosted VM option. It doesn't address having to push and pull, but definitely helps keep things isolated without much additional effort. I'll investigate this as a workaround and keep notifications active on this thread in case there is any good news in the future. |
As a follow up, I was able to push data directly from one storage account to another without this issue using the az cli through the storage blob copy start-batch option. For completeness, I'll note that I did test the proposed AzCopy push-pull workaround with pageblob specified as the --blob-type argument and it improved the destination storage usage, but not nearly as much as I'd hoped. The storage account with a 0.21 MB source that ballooned to 88 MB originally was only 36 MB with this method. Specifying a minimal block size (--block-size-mb=1) on the push dropped this down further to 9 MB. |
Thanks Harry. Glad to know you have a solution, and thanks for the additional info re the push-pull workaround. That adds useful information for what we'll need to look out for when we implement a proper solution in AzCopy. |
We'll be working on this in (at least) two parts, with the first part to be released in version 10.3 |
For completeness and future reference, in version 10.3 and later, it's possible to set block sizes smaller than 1MB. E .g. I just did a test with blocksize mb 0.125 (= 128 k). You just need to use a spreadsheet or something to find the exact decimal representation of the size you want. E.g. I did This can result in AzCopy finding more blocks that are all zeros, and which don't need to be uploaded. However, it does result in more IO Operations against storage (since each operation is smaller) and so may get throttled by IOPS limits. These comments about block sizes apply to both uploads, downloads and service-to-service copies. |
This has been fixed in 10.3.0! @hpaul-osi We are really sorry for the inconvenience and appreciate your patience. |
Which version of the AzCopy was used?
10.1.1
Which platform are you using? (ex: Windows, Mac, Linux)
Windows
What command did you run?
What problem was encountered?
After performing a container to container copy on a container with page blobs, the destination container is orders of magnitude larger than the source according to billing metrics. More details in the next section, but the same issue applies on the service to service level copy when the source service contains page blobs.
How can we reproduce the problem in the simplest way?
Create a storage account and container with several 4 MB page blobs with a small amount of data in them. Copy the container to another container with the azcopy copy command as specified under the "What command did you run?" section.
A naive size calculation like the one described in Calculate the size of a Blob storage container shows the same value for both source and destination containers (n x 4MB, where n is the number of page blobs in the container).
However, the more precise calculation outlined in Calculate the total billing size of a blob container shows the destination as n x 4 MB whereas the source is the actual utilization. As a more concrete example, for one of the containers with 22 page blobs, the source container is 0.21 MB and the destination container is 88 MB. At the storage account level where there are hundreds of these containers, this can result in a source storage account between ~100-200 MB ballooning to ~100-200 GB at the destination. Note: the "Used capacity" metric under the Storage Account "Metrics" blade in the portal agree with this more precise calculation.
Given the discrepancy, I think that AzCopy is writing the copied blobs in such a way that the whole blob is being consumed instead of the portion with data.
Have you found a mitigation/solution?
Not yet. I will update this thread if I do.
The text was updated successfully, but these errors were encountered: