Skip to content

Data integrity and validation

John Rusk [MSFT] edited this page Jul 2, 2020 · 16 revisions

Network error prevention

We encourage the use of HTTPS for all AzCopy traffic. While the primary reason is security, HTTPS also has the side effect of providing another layer of "error detection" on top of TCP's built-in detection. HTTPS has this effect because it's a tamper-resistant protocol. By preventing anyone deliberately changing your data as it crosses the network, HTTPS also prevents network errors from accidentally changing your data.

MD5 Hashes

AzCopyV10 supports MD5 hashes to validate the integrity of file contents. To opt in to this mechanism, include --put-md5 on the command line when uploading to Azure. NOTE that the actual check does not happen until the uploaded blob is used (i.e. downloaded) by AzCopy or another MD5-aware tool.

The overall process looks like this:

  1. At upload time, the hash of the original disk file is computed, and recorded against the blob. I.e. hash of source file is stored against blob.
  2. At download time, when the file is written to disk, a new hash is computed. This new "download time" hash is compared to the original hash from the time of upload. If they match, that proves that the downloaded file, as written to disk, exactly matches the original file as read at the time of upload. By default, AzCopy will signal a failure if they don't match. This behavior can be configured by the --check-md5 flag. The default is to check hashes for all blobs that have them (i.e. all blobs which uploaded with AzCopy's --put-md5 or with another tool that uploads MD5s).

Checking lots of blobs

If you have uploaded a large amount of data, and want to check the MD5 hashes without the time and cost of downloading it, AzCopyV10 offers a shortcut that may help. Instead of downloading to your own premises, you can download to an Azure VM, and configure AzCopy to check the hashes but not actually save any data to the VM. Because its not saving any data, it can run very fast, and you can check as much data as needed without needing to provision any disks.

To check MD5s in this way, use a command line like this:

azcopy copy <urlOfDataToBeChecked> NUL --check-md5 FailIfDifferentOrMissing

Use NUL on Windows and /dev/null on Linux and MacOS. The key points to note in this command line are that the destination is NUL (or /dev/null) so the data never gets saved, and the check-md5 flag is set to its strictest setting, which says to report failures on any blob where the hash doesn't match or where the blob has no stored hash. (That's stricter than the default, which normally does not report absence of a hash as an error).

The source, shown above as <urlOfDataToBeChecked>, is just the blob container or Azure Files URL of the data you want to check (do not include the < >).

If you have Terabytes of data to check, you should use a relatively large VM (e.g. 16 cores) to maximize throughput. If the VM is in the same region as your Storage account, you won't be charged any data egress fees when running this check.

A note on hash formats

Note that, in Azure, the Content-MD5 blob property (where the hash is stored) is not just the raw bytes out of the MD5 algorithim. Instead, to produce a valid Azure Content-MD5, you must take the raw bytes returned by the MD5 algorithm, and base64 encode them. AzCopy does this automatically. This paragraph is just for users who want to do their own hash computations and compare to those produced by AzCopy