Destination data has significantly larger billable size than the source data [for sparse page blobs] #391

hpaul-osi · 2019-05-17T20:14:13Z

Which version of the AzCopy was used?

10.1.1

Which platform are you using? (ex: Windows, Mac, Linux)

Windows

What command did you run?

$source = "https://$sourceStorageAccount.blob.core.windows.net/$sourceContainer$sourceSas"
$destination = "https://$backupStorageAccount.blob.core.windows.net/$backupContainer$backupSas"
.\azcopy.exe copy $source $destination --recursive

What problem was encountered?

After performing a container to container copy on a container with page blobs, the destination container is orders of magnitude larger than the source according to billing metrics. More details in the next section, but the same issue applies on the service to service level copy when the source service contains page blobs.

How can we reproduce the problem in the simplest way?

Create a storage account and container with several 4 MB page blobs with a small amount of data in them. Copy the container to another container with the azcopy copy command as specified under the "What command did you run?" section.

A naive size calculation like the one described in Calculate the size of a Blob storage container shows the same value for both source and destination containers (n x 4MB, where n is the number of page blobs in the container).

However, the more precise calculation outlined in Calculate the total billing size of a blob container shows the destination as n x 4 MB whereas the source is the actual utilization. As a more concrete example, for one of the containers with 22 page blobs, the source container is 0.21 MB and the destination container is 88 MB. At the storage account level where there are hundreds of these containers, this can result in a source storage account between ~100-200 MB ballooning to ~100-200 GB at the destination. Note: the "Used capacity" metric under the Storage Account "Metrics" blade in the portal agree with this more precise calculation.

Given the discrepancy, I think that AzCopy is writing the copied blobs in such a way that the whole blob is being consumed instead of the portion with data.

Have you found a mitigation/solution?

Not yet. I will update this thread if I do.

JohnRusk · 2019-05-17T20:33:19Z

@hpaul-osi Thanks for your detailed description. I suspect your diagnosis of the issue is correct, but I'll need to check with some colleagues to be sure.

It's a tricky one to deal with, because when copying between accounts (or containers) AzCopy v10 doesn't actually see the content of the blobs. We just tell the destination to pull it directly from the source. That's great for throughput, but means that AzCopy itself can't see if a particular block is actually all zeros and so should not be copied. We'll give it some thought!

In the meantime, the only workaround that I can think of in AzCopy v10 is to download the blobs to a file system (i.e. disk) then upload them. That's a bit of a nuisance, because it's a two step process. And it becomes difficult for very large payloads. If you need to try it, remember to put --blob-type PageBlob on the upload command line, otherwise they'll end up as block blobs!

hpaul-osi · 2019-05-17T20:53:52Z

Thank you for the quick response and the caution regarding the blob type. Unfortunately, our main driver for leveraging AzCopy for this was the ability to replicate data from one storage account to another directly without ever having to handle it on-prem. Pulling and pushing will make the process too slow with our expected volumes.

Since AzCopy is capable of doing this in two steps, hopefully a future version will be able to handle it in one. I can see where the current behavior is advantageous since AzCopy doesn't need to know anything about the payload. Alternatively, it would also be great for throughput if AzCopy only had to send a small fraction of the data. I understand that this may not make sense for a default behavior, but an option to handle sparse blobs could save a lot on both bandwidth for the transfer and storage on the destination.

JohnRusk · 2019-05-17T21:11:24Z

Yes, I totally agree with your points.

BTW, if you do want/need to try the two stage copy, there's no need to move it through an on prem disk. Instead, you can use an Azure VM for that purpose. You'd need to pick one with sufficient network bandwidth, and ensure you have sufficient disk space to hold the full size of the blobs (i.e. the size as you currently see it at your destination). I'm not sure exactly whether the most optimal approach would be to use the temp drive or managed disks, but I can find out for you if you want. I do know that one of the new big managed disks (8, 16 or 32 TB) should work fine. What you should end up with is only seeing the "big" size on the VM disk, and the destination size should equal the source size. I don't know whether this is a good option for you, but just mentioning it in case.

Finally, I have just heard back from a senior colleague. He has suggested a way that, in a future release, AzCopy could detect which pages actually contain content, and only move them. I'll put that work on our backlog now, but can't give you any ETA for a release at this stage.

hpaul-osi · 2019-05-19T16:19:37Z

Good point re Azure hosted VM option. It doesn't address having to push and pull, but definitely helps keep things isolated without much additional effort. I'll investigate this as a workaround and keep notifications active on this thread in case there is any good news in the future.

hpaul-osi · 2019-05-19T20:09:51Z

As a follow up, I was able to push data directly from one storage account to another without this issue using the az cli through the storage blob copy start-batch option.

For completeness, I'll note that I did test the proposed AzCopy push-pull workaround with pageblob specified as the --blob-type argument and it improved the destination storage usage, but not nearly as much as I'd hoped. The storage account with a 0.21 MB source that ballooned to 88 MB originally was only 36 MB with this method. Specifying a minimal block size (--block-size-mb=1) on the push dropped this down further to 9 MB.

JohnRusk · 2019-05-19T21:03:31Z

Thanks Harry. Glad to know you have a solution, and thanks for the additional info re the push-pull workaround. That adds useful information for what we'll need to look out for when we implement a proper solution in AzCopy.

JohnRusk · 2019-09-23T03:39:31Z

We'll be working on this in (at least) two parts, with the first part to be released in version 10.3

JohnRusk · 2019-10-06T22:20:29Z

Specifying a minimal block size (--block-size-mb=1) on the push dropped this down further to 9 MB

For completeness and future reference, in version 10.3 and later, it's possible to set block sizes smaller than 1MB. E .g. I just did a test with blocksize mb 0.125 (= 128 k). You just need to use a spreadsheet or something to find the exact decimal representation of the size you want. E.g. I did (128*1024)/(1024*1024)

This can result in AzCopy finding more blocks that are all zeros, and which don't need to be uploaded. However, it does result in more IO Operations against storage (since each operation is smaller) and so may get throttled by IOPS limits.

These comments about block sizes apply to both uploads, downloads and service-to-service copies.

zezha-msft · 2019-10-10T23:22:27Z

This has been fixed in 10.3.0!

@hpaul-osi We are really sorry for the inconvenience and appreciate your patience.

JohnRusk changed the title ~~Destination data has significantly larger billable size than the source data.~~ Destination data has significantly larger billable size than the source data [for sparse page blobs] May 17, 2019

zezha-msft mentioned this issue Jun 5, 2019

does az storage blob download take advantage of sparse files? It should.. Azure/azure-cli#5872

Closed

zezha-msft self-assigned this Aug 29, 2019

JohnRusk added this to the 10.3.0 milestone Sep 23, 2019

zezha-msft closed this as completed Oct 10, 2019

hpaul-osi mentioned this issue Nov 11, 2019

Destination data has significantly larger billable size than the source data [for sparse page blobs] Azure/azure-storage-net-data-movement#194

Open

rogeraustin mentioned this issue Dec 9, 2019

az storage copy does not preserve sparseness of page blobs copied between storage accounts Azure/azure-cli#11509

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Destination data has significantly larger billable size than the source data [for sparse page blobs] #391

Destination data has significantly larger billable size than the source data [for sparse page blobs] #391

hpaul-osi commented May 17, 2019

JohnRusk commented May 17, 2019

hpaul-osi commented May 17, 2019

JohnRusk commented May 17, 2019

hpaul-osi commented May 19, 2019

hpaul-osi commented May 19, 2019

JohnRusk commented May 19, 2019

JohnRusk commented Sep 23, 2019

JohnRusk commented Oct 6, 2019 •

edited

Loading

zezha-msft commented Oct 10, 2019

Destination data has significantly larger billable size than the source data [for sparse page blobs] #391

Destination data has significantly larger billable size than the source data [for sparse page blobs] #391

Comments

hpaul-osi commented May 17, 2019

Which version of the AzCopy was used?

Which platform are you using? (ex: Windows, Mac, Linux)

What command did you run?

What problem was encountered?

How can we reproduce the problem in the simplest way?

Have you found a mitigation/solution?

JohnRusk commented May 17, 2019

hpaul-osi commented May 17, 2019

JohnRusk commented May 17, 2019

hpaul-osi commented May 19, 2019

hpaul-osi commented May 19, 2019

JohnRusk commented May 19, 2019

JohnRusk commented Sep 23, 2019

JohnRusk commented Oct 6, 2019 • edited Loading

zezha-msft commented Oct 10, 2019

JohnRusk commented Oct 6, 2019 •

edited

Loading