Overview
Windows Azure Blob service provides mechanisms to ensure data integrity both at the application and transport layers. This post will detail these mechanisms from the service and client perspective. MD5 checking is optional on both PUT and GET operations; however it does provide a convenience facility to ensure data integrity across the network when using HTTP. Additionally since HTTPS provides transport layer security additional MD5 checking is not needed while connecting over HTTPS as it would be redundant.
To ensure data integrity the Windows Azure Blob service uses MD5 hashes of the data in a couple different manners. It is important to understand how these values are calculated, transmitted, stored, and eventually enforced in order to appropriately design your application to utilize them to provide data integrity.
Please note, the Windows Azure Blob service provides a durable storage medium, and uses its own integrity checking for stored data. The MD5's that are used when interacting with an application are provided for checking the integrity of the data when transferring that data between the application and service via HTTP. For more information regarding the durability of the storage system please refer to the Windows Azure Storage Architecture Overview.
The following table shows the Windows Azure Blob service REST APIs and the MD5 checks provided for them:
REST API |
Header |
Value |
Validated By |
Notes |
x-ms-blob-content-md5 |
MD5 value of blobs bits |
Server |
Full Blob |
|
Content-MD5 |
MD5 value of blobs bits |
Server |
Full Blob, If x-ms-blob-content-md5 is present Content-md5 is ignored |
|
Content-MD5 |
MD5 value of block bits |
Server |
Validated prior to storing the block |
|
Content-MD5 |
MD5 value of page bits |
Server |
Validated prior to storing the page |
|
x-ms-blob-content-md5 |
MD5 value of blobs bits |
Client on subsequent download |
Stored as the Content-MD5 blob property to be downloaded with blob for client side checks |
|
x-ms-blob-content-md5 |
MD5 value of blobs bits |
Client on subsequent download |
Sets the blob Content-MD5 property. |
|
Content-MD5 |
MD5 value of blobs bits |
Client |
Returns the Content-MD5 property if one was stored/set with the blob |
|
Get Blob (range) |
Content-MD5 |
MD5 value of blobs range bits |
Client |
If client specifies x-ms-range-get-content-md5: true the Content-MD5 header will be dynamically calculated over the range of bytes requested. This is restricted to <= 4 MB range requests |
Content-MD5 |
MD5 value of blobs bits |
Client |
Returns the Content-MD5 property if one was stored/set with the blob |
Table 1 : REST API MD5 Compatibility
Service Perspective
From the Windows Azure Blob Storage service perspective the only MD5 values that are explicitly calculated and validated on each transaction are the transport layer (HTTP) MD5 values. MD5 checking is optional on both PUT and GET operations. Note, since HTTPS provides transport layer security when using HTTPS any additional MD5 checking would be redundant, so MD5 checking is not needed when using HTTPS. We will be discussing two separate MD5 values which will provide checks for at different layers:
- PUT with Content-MD5: When a content MD5 header is specified, the storage service calculates an MD5 of the data sent and checks that with the Content-MD5 that was also sent. If the two hashes do not match, the operation will fail with error code 400 (Bad Request). These values are transmitted via the Content-MD5 HTTP header. This validation is available for PutBlob, PutBlock and PutPage. Note, when uploading a block, page, or blob the service will return the Content-MD5 HTTP header in the response populated with the MD5 it calculated for the data received.
- PUT with x-ms-blob-content-md5: The application can also set the Content-MD5 property that is stored with a blob. The application can pass this in with the header x-ms-blob-content-md5, and the value with this is stored as the Content-MD5 header to be returned on subsequent GETs for the blob. This can be set when using PutBlob, PutBlockList or SetBlobProperties for the blob. If a user provides this value on upload all subsequent GET operations will return this header with the client provided value. The x-ms-blob-content-md5 header is a header we introduced for scenarios where we wanted to specify the hash for the blob content when the http request content is not fully indicative of the actual blob data, such as in PutBlockList. In a PutBlockList, the Content-MD5 header would provide transactional integrity for the message contents (the block list in the request body) , while the x-ms-blob-content-md5 header would set the service side blob property. To reiterate, if a x-ms-blob-content-md5 header is provided it will supersede the Content-MD5 header on a PutBlob operation, for a PutBlock or PutPage operation it is ignored.
- GET: On a subsequent GET operation the service will optionally populate the Content-MD5 HTTP header if a value was previously stored with the blob via a PutBlob, PutBlockList, or SetBlobProperties. For range GETs an optional x-ms-range-get-content-md5 header can be added to the request. When this header is set to true and specified together with the Range header for a range GET, the service dynamically calculates an MD5 for the range and returns it in the Content-MD5 header, as long as the range is less than or equal to 4 MB in size. If this header is specified without the Range header, the service returns status code 400 (Bad Request). If this header is set to true when the range exceeds 4 MB in size, the service returns status code 400 (Bad Request).
Client Perspective
We have already discussed above how the Windows Azure Blob service can provide transport layer security via the Content-MD5 HTTP header or HTTPS. In addition to this the client can store and manually validate MD5 hashes on the blob data from the application layer. The Windows Azure Storage Client library provides this calculation functionality via the exposed object model and relevant abstractions such as BlobWriteStream.
Storing Application layer MD5 when Uploading Blobs via the Storage Client Library
When utilizing the CloudBlob Convenience layer methods in most cases the library will automatically calculate and transmit the application layer MD5 value. However, there is an exception to this behavior when a call to an upload method results in
- A single PUT operation to the Blob service, which will occur when source data is less than CloudBlobClient.SingleBlobUploadThresholdInBytes.
- A parallel upload (length > CloudBlobClient.SingleBlobUploadThresholdInBytes and CloudBlobClient.ParallelOperationThreadCount > 1).
In both of the above cases, an MD5 value is not passed in to be checked, so in this scenario if the client requires data integrity checking they need to make sure and use HTTPS. (HTTPS can be enabled when constructing a CloudStorageAccount via the constructor or by specifying HTTPS as part of the baseAddress when manually constructing a CloudBlobClient)
All other blob upload operations from the convenience layer in the SDK send MD5’s that are checked at the blob service.
In addition to the exposed object methods, you can also provide the x-ms-blob-content-md5 header via the Protocol layer on a PutBlob or PutBlockList request.
The below table lists the convention functions used to upload blobs, and which ones support sending MD5 checks and when they are sent.
Layer |
Method |
Notes |
Convenience |
CloudBlob.OpenWrite |
MD5 is sent. Note, this function is not currently supported for PageBlob |
Convenience |
CloudBlob.UploadByteArray |
MD5 is sent if:
|
Convenience |
CloudBlob.UploadFile |
MD5 is sent if:
|
Convenience |
CloudBlob.UploadText |
MD5 is sent if:
|
Convenience |
CloudBlob.UploadFromStream |
MD5 is sent if:
|
Table 2 : Blob upload methods MD5 compatibility
Validating Application Layer MD5 when downloading Blobs via the Storage Client Library
The CloudBlob Download methods do not provide application layer MD5 validation; as such it is up to the application to verify the Content-MD5 returned against the data returned by the service. If the application layer MD5 value was specified on upload the Windows Azure Storage Client Library will populate it in CloudBlob.Properties.ContentMD5 on any download (i.e. DownloadText, DownloadByteArray, DownloadToFile, DownloadToStream, and OpenRead).
The example below shows how a client can validate the blobs MD5 hash once all the data is retrieved.
Example
// Initialization string blobName = "md5test" + Guid.NewGuid().ToString(); long blobSize = 8 * 1024 * 1024; StorageCredentialsAccountAndKey creds = new StorageCredentialsAccountAndKey(AccountName, AccountKey); CloudStorageAccount account = new CloudStorageAccount(creds, false); CloudBlobClient bClient = account.CreateCloudBlobClient(); // Set CloudBlobClient.SingleBlobUploadThresholdInBytes, all blobs above this // length will be uploaded using blocks bClient.SingleBlobUploadThresholdInBytes = 4 * 1024 * 1024; // Create Blob Container CloudBlobContainer container = bClient.GetContainerReference("md5blobcontainer"); Console.WriteLine("Validating the Container"); container.CreateIfNotExist(); // Populate Blob Data byte[] blobData = new byte[blobSize]; Random rand = new Random(); rand.NextBytes(blobData); MemoryStream retStream = new MemoryStream(blobData); // Upload Blob CloudBlob blobRef = container.GetBlobReference(blobName); // Any upload method will work here: byte array, file, text, stream blobRef.UploadByteArray(blobData); // Download will re-populate the client MD5 value from the server byte[] retrievedBuffer = blobRef.DownloadByteArray(); // Validate MD5 Value var md5Check = System.Security.Cryptography.MD5.Create(); md5Check.TransformBlock(retrievedBuffer, 0, retrievedBuffer.Length, null, 0); md5Check.TransformFinalBlock(new byte[0], 0, 0); // Get Hash Value byte[] hashBytes = md5Check.Hash; string hashVal = Convert.ToBase64String(hashBytes); if (hashVal != blobRef.Properties.ContentMD5) { throw new InvalidDataException("MD5 Mismatch, Data is corrupted!"); }
Figure 1: Validating a Blobs MD5 value
A note about Page Blobs
Page blobs are designed to provide a durable storage medium that can perform a high rate of IO. Data can be accessed in 512 byte pages allowing a high rate of non-contiguous transactions to complete efficiently. If HTTP needs to be used with MD5 checks, then the application should pass in the Content-MD5 on PutPage, and then use the x-ms-range-get-content-md5 on each subsequent GetBlob using ranges less than or equal to 4MBs.
Considerations
Currently the convenience layer of the Windows Azure Storage Client Library does not support passing in MD5 values for PageBlobs, nor returning Content-MD5 for getting PageBlob ranges. As such, if your scenario requires data integrity checking at the transport level it is recommended that you use HTTPS or utilize the Protocol Layer and add the additional Content-MD5 header.
In the following example we will show how to perform page blob range GETs with an optional x-ms-range-get-content-md5 via the protocol layer in order to provide transport layer security over HTTP.
Example
// Initialization string blobName = "md5test" + Guid.NewGuid().ToString(); long blobSize = 8 * 1024 * 1024; // Must be divisible by 512 int writeSize = 1 * 1024 * 1024; StorageCredentialsAccountAndKey creds = new StorageCredentialsAccountAndKey(AccountName, AccountKey); CloudStorageAccount account = new CloudStorageAccount(creds, false); CloudBlobClient bClient = account.CreateCloudBlobClient(); bClient.ParallelOperationThreadCount = 1; // Create Blob Container CloudBlobContainer container = bClient.GetContainerReference("md5blobcontainer"); Console.WriteLine("Validating the Container"); container.CreateIfNotExist(); int uploadedBytes = 0; // Upload Blob CloudPageBlob blobRef = container.GetBlobReference(blobName).ToPageBlob; blobRef.Create(blobSize); // Populate Blob Data byte[] blobData = new byte[writeSize]; Random rand = new Random(); rand.NextBytes(blobData); MemoryStream retStream = new MemoryStream(blobData); while (uploadedBytes < blobSize) { blobRef.WritePages(retStream, uploadedBytes); uploadedBytes += writeSize; retStream.Position = 0; } HttpWebRequest webRequest = BlobRequest.Get( blobRef.Uri, // URI 90, // Timeout null, // Snapshot (optional) 1024 * 1024, // Start Offset 3 * 1024 * 1024, // Count null); // Lease ID ( optional) webRequest.Headers.Add("x-ms-range-get-content-md5", "true"); bClient.Credentials.SignRequest(webRequest); WebResponse resp = webRequest.GetResponse();
Figure 2: Transport Layer security via optional x-ms-range-get-content-md5 header on a PageBlob
Summary
This article has detailed various strategies when utilizing MD5 values to provide data integrity. As with many cases the correct solution is dependent on your specific scenario.
We will be evaluating this topic in future releases of the Windows Azure Storage Client Library as we continue to improve the functionality offered. Please leave comments below if you have questions.
Joe Giardino
Links
- Storage Client Library Documentation - http://msdn.microsoft.com/en-us/library/dd179380.aspx
- Windows Azure SDK - http://msdn.microsoft.com/en-us/windowsazure/cc974146.aspx
- Windows Azure Storage Services REST API Reference - http://msdn.microsoft.com/en-us/library/dd179355.aspx