Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination data has significantly larger billable size than the source data [for sparse page blobs] #391

Closed
hpaul-osi opened this issue May 17, 2019 · 9 comments
Assignees
Milestone

Comments

@hpaul-osi
Copy link

Which version of the AzCopy was used?

10.1.1

Which platform are you using? (ex: Windows, Mac, Linux)

Windows

What command did you run?

$source = "https://$sourceStorageAccount.blob.core.windows.net/$sourceContainer$sourceSas"
$destination = "https://$backupStorageAccount.blob.core.windows.net/$backupContainer$backupSas"
.\azcopy.exe copy $source $destination --recursive

What problem was encountered?

After performing a container to container copy on a container with page blobs, the destination container is orders of magnitude larger than the source according to billing metrics. More details in the next section, but the same issue applies on the service to service level copy when the source service contains page blobs.

How can we reproduce the problem in the simplest way?

Create a storage account and container with several 4 MB page blobs with a small amount of data in them. Copy the container to another container with the azcopy copy command as specified under the "What command did you run?" section.

A naive size calculation like the one described in Calculate the size of a Blob storage container shows the same value for both source and destination containers (n x 4MB, where n is the number of page blobs in the container).

However, the more precise calculation outlined in Calculate the total billing size of a blob container shows the destination as n x 4 MB whereas the source is the actual utilization. As a more concrete example, for one of the containers with 22 page blobs, the source container is 0.21 MB and the destination container is 88 MB. At the storage account level where there are hundreds of these containers, this can result in a source storage account between ~100-200 MB ballooning to ~100-200 GB at the destination. Note: the "Used capacity" metric under the Storage Account "Metrics" blade in the portal agree with this more precise calculation.

Given the discrepancy, I think that AzCopy is writing the copied blobs in such a way that the whole blob is being consumed instead of the portion with data.

Have you found a mitigation/solution?

Not yet. I will update this thread if I do.

@JohnRusk
Copy link
Member

@hpaul-osi Thanks for your detailed description. I suspect your diagnosis of the issue is correct, but I'll need to check with some colleagues to be sure.

It's a tricky one to deal with, because when copying between accounts (or containers) AzCopy v10 doesn't actually see the content of the blobs. We just tell the destination to pull it directly from the source. That's great for throughput, but means that AzCopy itself can't see if a particular block is actually all zeros and so should not be copied. We'll give it some thought!

In the meantime, the only workaround that I can think of in AzCopy v10 is to download the blobs to a file system (i.e. disk) then upload them. That's a bit of a nuisance, because it's a two step process. And it becomes difficult for very large payloads. If you need to try it, remember to put --blob-type PageBlob on the upload command line, otherwise they'll end up as block blobs!

@JohnRusk JohnRusk changed the title Destination data has significantly larger billable size than the source data. Destination data has significantly larger billable size than the source data [for sparse page blobs] May 17, 2019
@hpaul-osi
Copy link
Author

Thank you for the quick response and the caution regarding the blob type. Unfortunately, our main driver for leveraging AzCopy for this was the ability to replicate data from one storage account to another directly without ever having to handle it on-prem. Pulling and pushing will make the process too slow with our expected volumes.

Since AzCopy is capable of doing this in two steps, hopefully a future version will be able to handle it in one. I can see where the current behavior is advantageous since AzCopy doesn't need to know anything about the payload. Alternatively, it would also be great for throughput if AzCopy only had to send a small fraction of the data. I understand that this may not make sense for a default behavior, but an option to handle sparse blobs could save a lot on both bandwidth for the transfer and storage on the destination.

@JohnRusk
Copy link
Member

Yes, I totally agree with your points.

BTW, if you do want/need to try the two stage copy, there's no need to move it through an on prem disk. Instead, you can use an Azure VM for that purpose. You'd need to pick one with sufficient network bandwidth, and ensure you have sufficient disk space to hold the full size of the blobs (i.e. the size as you currently see it at your destination). I'm not sure exactly whether the most optimal approach would be to use the temp drive or managed disks, but I can find out for you if you want. I do know that one of the new big managed disks (8, 16 or 32 TB) should work fine. What you should end up with is only seeing the "big" size on the VM disk, and the destination size should equal the source size. I don't know whether this is a good option for you, but just mentioning it in case.

Finally, I have just heard back from a senior colleague. He has suggested a way that, in a future release, AzCopy could detect which pages actually contain content, and only move them. I'll put that work on our backlog now, but can't give you any ETA for a release at this stage.

@hpaul-osi
Copy link
Author

Good point re Azure hosted VM option. It doesn't address having to push and pull, but definitely helps keep things isolated without much additional effort. I'll investigate this as a workaround and keep notifications active on this thread in case there is any good news in the future.

@hpaul-osi
Copy link
Author

As a follow up, I was able to push data directly from one storage account to another without this issue using the az cli through the storage blob copy start-batch option.

For completeness, I'll note that I did test the proposed AzCopy push-pull workaround with pageblob specified as the --blob-type argument and it improved the destination storage usage, but not nearly as much as I'd hoped. The storage account with a 0.21 MB source that ballooned to 88 MB originally was only 36 MB with this method. Specifying a minimal block size (--block-size-mb=1) on the push dropped this down further to 9 MB.

@JohnRusk
Copy link
Member

Thanks Harry. Glad to know you have a solution, and thanks for the additional info re the push-pull workaround. That adds useful information for what we'll need to look out for when we implement a proper solution in AzCopy.

@JohnRusk
Copy link
Member

We'll be working on this in (at least) two parts, with the first part to be released in version 10.3

@JohnRusk
Copy link
Member

JohnRusk commented Oct 6, 2019

Specifying a minimal block size (--block-size-mb=1) on the push dropped this down further to 9 MB

For completeness and future reference, in version 10.3 and later, it's possible to set block sizes smaller than 1MB. E .g. I just did a test with blocksize mb 0.125 (= 128 k). You just need to use a spreadsheet or something to find the exact decimal representation of the size you want. E.g. I did (128*1024)/(1024*1024)

This can result in AzCopy finding more blocks that are all zeros, and which don't need to be uploaded. However, it does result in more IO Operations against storage (since each operation is smaller) and so may get throttled by IOPS limits.

These comments about block sizes apply to both uploads, downloads and service-to-service copies.

@zezha-msft
Copy link
Contributor

This has been fixed in 10.3.0!

@hpaul-osi We are really sorry for the inconvenience and appreciate your patience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants