In October 2017 at the dataMinds conference (former SQL User Days) I gave a presentation about Data Lake Analytics and one of my slides was the following:
This picture was taken and I found that it was quite a nice topic to write something about. So here we are!
In Azure there are 3 different big data stores that you can use and in this blog post I want to explain the differences between them.
Azure Cosmos DB (former Document DB) is a multi-model database or NoSQL database. What does that mean and what are some features of Azure Cosmos DB?
Azure SQL Data Warehouse is the easiest to explain if you are already using SQL Server. It’s the same as an ordinary SQL Server database but developed for Petabyte scaled databases. By making use of Massively Parallel Processing it can quickly run complex queries across petabytes of data.
It stores relational tables with columnar storage. That way it can significantly reduce the data storage cost and improves query performance. Compared with traditional database systems, analysis queries finish in seconds instead of minutes, or hours instead of days.
More info can be found at: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/
What is a data lake? One of my colleagues explained it quite nicely. If you take a bottle of water, it’s all cleansed and ready to drink. That water can be poured back into a lake or a river, and nothing will happen. Now if we do that the other way around and we fill our bottle with water from the lake, the bottle might be contaminated with trash or diseases.
So in a data lake we can store any kind of data, unprocessed (river water) and processed (bottled water). And the nicest thing is that it does not have to be structured. Relational data in any sort is not necessary. You can add images, music, documents, databases, text files, … No size limit to files or total storage (That is not possible with Azure Blob Storage). That are the storage capabilities of Data Lake. But an extra resource that is available is Data Lake Analytics that has a query language called “U-SQL”. This so called U-SQL gives us the ability to query text-files but also documents (docx, pdf, …) and images.
How nice would it be if you have a collection of photos and you would want to see only the photos that have a horse on it. Or only with people who are smiling. Or what about executing Python or R code on data that is just there. It’s all possible.
When do we use what? If you need a multi-model database go to Cosmos DB. If you need a relational database make use of SQL Data Warehouse, but if your data exists of all kind of different types of files then make use of Data Lake to clean and transform it, so you can store it in one of the others if necessary.