Microsoft’s Parallel Data Warehouse, released last year for SQL Server 2008 R2, was designed to compete with products
like Oracle Exadata and Teradata, allowing for truly massive workloads on a system that could scale data across multiple physical machines while still using the Microsoft SQL Server database engine we know and love.
Parallel Data Warehouse, or PDW, is not a software system that you can simply purchase and install on your own hardware. PDW starts as a two-rack configuration. One rack is for management machines, made up of servers functioning as the management nodes, control nodes, a landing zone node and backup node.
The second rack houses so-called compute nodes, which do the actual data storage. This rack of compute nodes, called a data rack, holds 10 dual chip servers (this may vary by vendor as well as when you order) and dedicated SAN storage, which is connected to the servers. As you need either more CPU power or more disks to store data, you simply add on more data racks to expand the Parallel Data Warehouse.
The control nodes serve a variety of functions; most important, they let you query the Parallel Data Warehouse. You connect to the control nodes, not by using SQL Server Management Studio (SSMS), but with a third-party tool called Nexus Chameleon, which is included with the PDW. This third-party tool is needed because SSMS hasn’t been reworked to connect directly to the control node of the Parallel Data Warehouse.
Running on the control node is a specialized version of SQL Server that takes in Transact-SQL queries and passes them off to the PDW compute nodes for execution. No data from the data warehouse is actually stored on this instance. It instead stores metadata about the database configuration and provides scratch space by way of its tempdb database in case that space is needed during query execution.
In addition to handling your data query requests, the control nodes also handle your data loading process. Data loading can be done a few different ways. The simplest would probably be to use Business Intelligence Development Studio (BIDS) and use the new SQL Server Integration Services destination connection object, which has been created specifically for use with the PDW. The second way is to use load the data into what is called the landing zone and then use a new application called DWLOADER to bulk load the data into the Parallel Data Warehouse. DWLOADER works a lot like the bcp utility on a traditional SQL Server instance, loading data quickly into the needed tables.
Many people think the PDW is like partitioned tables in SQL Server or Oracle; at first the concepts sound similar, but they are actually very different. With partitioned tables you are isolating specific ranges of data down to a single partition, so you need to query only a single partition when you need to query data. The concept behind the Parallel Data Warehouse is the flip side of that. With the PDW, you instead want to push the most current data to every compute node in the system; that way, when you query the data from the system you get the highest number of compute nodes possible working on the data request.
By using all of the compute nodes, you can increase the number of CPUs available for you to use, as well as ratchet up the amount of I/O you can generate at once. This massive addition of resources greatly reduces the amount of time workloads require to complete. Workloads that would take days when processing just a week’s worth of data take just minutes. This means you can run reports that could not have been processed on a traditional SQL Server instance because of the amount of time it would take.
The Parallel Data Warehouse doesn’t do everything well, however. For example, inserts will typically take longer on the PDW than they would on a traditional SQL Server instance. As data is inserted into the Parallel Data Warehouse, every row needs to be examined to determine which compute node the data should be stored on. When tables are created within the Parallel Data Warehouse, you specify if the data should be replicated in whole to all the compute nodes; typically, a dimension table would be set up this way.
The second table configuration option would be to distribute the data between the compute nodes based on one of the columns within the table. When data is loaded, the column that was specified as the distribution key is hashed, and the hashed value is compared with an internal table to determine which compute node the data should be stored on. As the data is loaded into the system through the control node, all this hashing has to happen; then the rows are transferred from the control node to the compute node for storage.
Another thing the Parallel Data Warehouse doesn’t always do well is aggregates. This is because depending on how the query is written, all the data being aggregated may need to be brought from the compute nodes to the control node, stored in the tempdb database and then aggregated once the data has been received from all the compute nodes. This can take a long time to complete, depending on how much data is being pulled from the various compute nodes, since the data needs to transfer from over the network within the Parallel Data Warehouse to get to the control node. To some extent this can be avoided, depending on how your tables are designed and how your queries are written. If you intend on doing a log of aggregates, be sure to thoroughly test all your queries to ensure you have correctly designed your tables to minimize runtimes.
The infrastructure of the Parallel Data Warehouse is a highly redundant one. Most everything within the PDW comes with a backup built into the rack. There are two management nodes, which function as the domain controllers for the Parallel Data Warehouse (the PDW has its own Active Directory to isolate the machines from any group policy settings that shouldn’t be applied to it) and allow the systems admins to do things like patch all the other machines in the configuration. The control nodes come in an active/passive Windows cluster. The compute nodes are configured as a single cluster with a passive server available so that if a physical machine were to fail, the PDW wouldn’t lose any CPU resources.
The Microsoft Parallel Data Warehouse is an extremely powerful tool that companies can use to their advantage. However, it can’t be thought of as just another SQL Server instance; it has some very specific configurations and some very specific data loading procedures. These tools and requirements must be understood before a PDW is put into production. Once it is, you can throw every report idea that you’ve ever been able to dream up at the thing and quickly get a result.
ABOUT THE AUTHOR
Denny Cherry has more than a decade of experience managing SQL Server, including MySpace.com's over 175-million-user installation. Cherry specializes in system architecture, performance tuning, replication and troubleshooting. Check out his blog at SQL Server with Mr. Denny.