In late April, Microsoft released the second version of Azure SQL Data Warehouse that provides significant performance...
improvements over the first version. Officially known as the Compute Optimized Gen2 service tier, the upgraded technology puts Azure SQL DW on par with the performance of cloud data warehouse services from other vendors.
Initially launched for beta testing in 2015 and made generally available the following year, Azure SQL Data Warehouse uses massively parallel processing functionality to run complex analytical queries across large data sets. That's similar to rival cloud data warehouse platforms like Amazon Redshift, Snowflake and IBM Db2 Warehouse on Cloud, formerly known as dashDB for Analytics.
What makes Azure SQL Data Warehouse stand out from those technologies is that it allows users to scale compute and storage resources independently of each other, as well as to pause their data warehouses and resume operations when needed. These scaling and pausing capabilities help companies to better control processing costs because they don't have to pay for resources they're not using. That gives SQL DW an important edge over its competitors.
However, the first incarnation of Azure SQL DW lagged behind the competition when it came to performance, particularly compared to Redshift. Microsoft sought to address the performance discrepancies with the release of the Gen2 tier, which promises to deliver fivefold increases in computing power and query performance and four times more query concurrency than the original service.
The first service is now dubbed Azure SQL Data Warehouse Compute Optimized Gen1 and it is still offered as a lower-end option.
To deliver these improved capabilities, Azure SQL DW Gen2 extends the cloud service's workload management capabilities and includes adaptive caching for its compute engine.
Gen2 enables more data, bigger workloads
The workload management side of the tool uses columnstore storage, the default storage type used by Azure SQL DW. With columnstore storage, table rows are divided into row groups and then compressed using a variety of advanced techniques. This approach not only reduces storage costs, but it can also improve query performance, according to Microsoft.
Prior to the introduction of the Gen2 tier, data warehouses were limited to 240 TB of on-disk compressed data. The Gen1 tier still adheres to that size limitation, but Azure SQL DW Gen2 supports unlimited columnstore data -- a change made possible by the expanded computing power the new version provides.
When configuring the cloud service, users choose the amount of computing power they need based on a number of data warehouse units (DWU). A DWU is a predefined measure of the CPU, memory and I/O resources available to Azure SQL Data Warehouse systems. Gen1 users can choose subscription bundles that range between 100 to 6,000 DWUs. By comparison, Gen2 offers bundles with between 1,000 and 30,000 DWUs.
The higher DWU numbers translate directly into greater power for processing analytical queries. Microsoft first raised the maximum performance level to 18,000 DWUs in a Gen2 preview release last year, but increased it even further in the final version. The company also extended Azure SQL DW's workload management functionality to support up to 128 concurrent queries in Gen2 -- the highest concurrency rate of any cloud data warehouse service currently available, Microsoft claims.
Adaptive caching keeps key data at hand
One of the biggest reasons that Redshift outperformed Azure SQL DW in the past is because of the cluster architecture employed by AWS, specifically with regard to storage. Under the Redshift model, each compute node has its own direct-attached storage, whereas the SQL DW service relies primarily on Azure Blob Storage, a remote cloud-based storage service.
Direct-attached storage has significant advantages over remote storage when it comes to query performance, which is why Microsoft introduced adaptive caching as part of Azure SQL DW Gen2. In the new version, a columnar storage cache resides next to the compute layer. Each compute node maintains its own cache of recently accessed storage segments, which helps to streamline access to Azure Blob Storage data.
The caching layer is made up of solid-state drives (SSDs) that leverage the nonvolatile memory express (NVMe) storage protocol. NVMe represents an important advancement in storage technologies because of its ability to better-utilize SSD throughput compared to traditional storage protocols, which were designed for hard disk drives and, as a result, can't keep up with SSD data loads.
But Microsoft didn't stop with NVMe. The Gen2 tier also incorporates intelligent caching technology that uses metrics on query frequencies and data access patterns to adapt to user workloads. The goal is to know what data needs to be cached and when in order to minimize remote calls to Azure Blob Storage, thereby accelerating data access and query performance.
An altered cloud data warehouse landscape
Azure SQL DW Gen2 represents an important milestone for Microsoft in delivering high-performance data warehousing in the cloud. SQL Data Warehouse already had the advantage of a more flexible delivery model than its rivals, as well as integration with other Microsoft cloud services and tools. In addition, Microsoft has extended the technology's reach to all 33 Azure regions worldwide, with Gen2 currently available in 23 of them after being launched in Australia and France in August.
Even so, it's the new performance capabilities that are causing people to take notice of Azure SQL DW. But before prospective users can draw any real-world conclusions, some serious benchmarking must be done to carefully compare Gen2 with other cloud data warehouse services to determine where the advantage really lies for performance.