Building Big Data on Azure Table Storage

When I accepted the position as Director of Business Intelligence at 3xLOGIC in 2013 I was immediately tasked with creating a cloud-based exception reporting platform for the retail industry.  The hardest part about taking that challenge on was that the company was operating on a shoe-string budget at the time and there was also almost no help available to me as far as development resources.  I was going to have to build the entire app - front-end, back-end, database, cloud processes - almost entirely on my own. 

Due to these constraints I knew I had to build a system that was extremely simple and didn't require day-to-day maintenance to keep running.  The platform, I ended up creating (now called Vigil Trends), is actively serving over 2000 retail locations and processes over 30 million unique pieces of data on a daily basis - with relative ease!

I attribute most of this success not to how intelligent I am (or am not), but rather in choosing the right tools for the job at hand.  In this article I'm going to explain how I used Azure Table Storage to house the 30 million pieces of daily data that make up the cloud platform I built.

Microsoft Azure was in its early days when I started using it in 2013.  The number of options available in Azure then was far less than what it has grown to today.  I needed a database solution that was affordable, scaled well, and didn't take too long to learn.  I came from a SQL Server background which is a great database in its own accord but I knew that, in the end, it was not a great choice for actual "big data" - the cost to store this amount of data in SQL Server running on Azure would have sunk this project in a short amount of time.  I needed a different data solution if this system was going to last in the long run.

It was then I decided to look into Azure Table Storage.  Table Storage is well built for big data projects.  It's a semi-structured, schemaless data store that offered me three major advantages:

  1. It is very affordable - It prices out at around $0.07 cents per GB, with price breaks at high volume that can take it all the way below $0.05 per GB.
  2. It scales extremely well - With the way Table Storage is built and delivered, as a developer I don't really have to worry about how or where my data are being stored, it's all managed "behind the scenes" by Microsoft.  I never have to add more storage or pay for additional services - it works exactly the same at 2GB or 20,000GB.
  3. It's significantly robust - Perhaps my favorite feature about Table Storage is that due to the way the data are partitioned, my query times are very predictable.  Whether I'm querying against 10 rows of data or 10,000 rows, the query execution time is almost identical (provided you have intelligently partitioned the table you are querying).

How To Partition Well

RBA Consulting has published a great article on how Partitioning works in Azure Table Storage: Designing a Scalable Partitioning Strategy for Azure Table Storage.  However I have a few points to add to this document as well.

Each row of data in a table has a PartitionKey and a RowKey.  The way I explain it to people who are new to Table Storage is like this - think of PartitionKey as a zip code and RowKey as a street address.  If I have both the Zip Code and street address of a house, I can find it very quickly.  However, if I have only the street address (i.e. 123 Main Street), then I have to search EVERY zip code to see if that street address is found in it.  That's very slow.  And if I have only a zip code, I can list for you all the available street addresses contained in the zip code.

It's important to understand this concept because there is a trade-off here: Partitions with a small number of rows in them will require you to run many small queries to bring back a larger data set; while having too many rows in a partition can cause your query times to slow down (due to a full partition scan).  Microsoft recommends having less than 2000 rows in a partition but I've found that storing more like 6000 rows in a partition doesn't have much of an adverse effect.

Compound Partition Keys

Another concept that helped me achieve consistent query execution times in an ever-growing mountain of data is creating compound partition keys.  All partition (and row) keys in Table Storage are stored as strings.  By default a partition is sorted alphabetically by the partition key.  So depending on the kind of data you are storing you can get creative with the key format.  For example, one of the common formats I used when storing receipt data was like this:

"CustomerNumber^StoreNumber^Year^Month^Day"

Creating partition keys in this format worked well in my situation because every query run by a user was always in respect to a specific customer, store number and date.  Knowing that requirement ahead of time allowed me to narrow in to a specific partition key easily and reliably.  Please note, that there is nothing special about using the "^" character, it just conveniently served as a separator that was unlikely to show up in normal data.

Storing the Same Data Twice

One last concept that helped me achieve big data success, was actual the hardest for me to learn, coming from a relational database world.  It involves storing the same row of data twice, either in the same table or in a separate table.  It is necessitated out of the idea that with Table Storage there are no relationships between tables or rows of data.  It's a non-relational database.  To work through this "problem" (not really a problem, but a different way of thinking), I got in the habit of storing the same row of data twice - each time with a different partition key.

For example, if I were to store receipt data using the partition key format mentioned in the previous section, I might also want to store the same data but make it quickly retrievable by a cashier id.  So my duplicate row's partition key might look like this:

"CustomerNumber^CashierID^Year^Month^Date"

You may even store the same row more than twice depending on how many different ways you are planning to query the data.  Again it's a trade-off here, this time between storage costs and query execution times.  Since the cost of storage is relatively cheap, it makes a lot of sense to duplicate the data in this way.

Conclusion

This article does not even skim the surface of what's possible with Azure Table Storage or even offer an implementation of the concepts discussed here.  However, I believe that the concepts covered in this article are important to grasp and are not widely circulated in the tech circles at this time.  If you are interested in learning more about how I use Azure Table Storage to process and store big data in the cloud please contact me today.