One of the more intriguing features included with SQL Server 2016 is PolyBase, a transparent access layer that facilitates connectivity between the database engine and external data sources containing unstructured or semi-structured data. PolyBase is optimized for data warehouse workloads and analytical query processing, making it easier than ever to merge big data into the SQL Server universe.
PolyBase is not new. It has been part of the Microsoft Analytics Platform System since it was still called Parallel Data Warehouse. PolyBase seamlessly bridges SQL Server with external data sources, allowing you to use familiar T-SQL constructs to create connections and query the data, just like you can traditional database tables.
Introducing PolyBase in SQL Server 2016
With PolyBase, you can access data in Hadoop clusters or Azure Blob Storage, using the same SQL Server features you use with relational data, whether creating statistics, implementing columnstore technologies, leveraging business intelligence (BI) tools or using other SQL Server components. You can even create queries that join semi-structured data with SQL Server relational data sets.
PolyBase is built into SQL Server 2016 and does not require special software or tools, nor does it require an understanding of Java, Map/Reduce, Hive or other Hadoop-related concepts. You can leverage existing SQL Server tools and use your T-SQL know-how to explore unstructured and semi-structured data, all within the familiar SQL Server environment.
When you use PolyBase to query Hadoop data, SQL Server pushes much of the processing to the Hadoop cluster, where the data resides. The PolyBase engine determines when to generate map jobs on an as-needed basis. In addition, the engine parallelizes operations when importing data from Hadoop or Azure.
The PolyBase operations remain transparent to the querying application because the external data source is incorporated into the database schema. You can query and integrate data from different systems, without having to customize your existing applications.
Getting started with PolyBase
To set up PolyBase, you must select the PolyBase feature PolyBase Query Service for External Data as part of your SQL Server installation. You can install PolyBase on only one SQL Server instance per machine. The installation process adds three user databases (DWConfiguration, DWDiagnostics and DWQueue) and two services (SQL Server PolyBase Engine and SQL Server PolyBase Data Movement Service).
Currently, you can use PolyBase to retrieve external data from the Cloudera and Hortonworks versions of Hadoop, as well as from Azure Blob Storage. Microsoft plans to add other data source types in future SQL Server versions.
After you install SQL Server with the PolyBase feature, you must run the sp configure system stored procedure to enable PolyBase for one of the supported data source types. From there, you can take the following steps to add the necessary schema objects to your database:
- Create an external data source object to define a connection to the Hadoop File System or to Azure Blob Storage. To create a data source object, use a CREATE EXTERNAL DATA SOURCE statement.
- Create an external file format object to specify the format of the Hadoop or Azure data. Currently, PolyBase supports three formats: delimited text, Hive RCFile and Hive ORC. To define a file format object, use a CREATE EXTERNAL FILE FORMAT statement.
- Create one or more external tables based on the data source and file format objects. The tables provide a schema-based structure for referencing the external data, and can be referenced just like relational SQL Server tables. To create a table, use a CREATE EXTERNAL TABLE statement.
If you view your database in SQL Server Management Studio, you'll find a couple of new folders in Object Explorer at the database level. The first is the External Resources subfolder beneath the database. This contains any defined data sources and file formats. In addition, the Tables folder will now include the External Tables subfolder, which lists any defined external tables.
Moving ahead with PolyBase
After you create your external tables, you can use the CREATE STATISTICS statement to define query optimization statistics for those tables. When you create statistics on an external table, the database engine imports the data into a temporary table in SQL Server.
Microsoft also has added a number of dynamic management views to SQL Server 2016 for troubleshooting PolyBase queries. In addition, you'll find several catalog views for retrieving information about the external data sources, file formats and tables that have been defined.
By including PolyBase directly in SQL Server, Microsoft has taken an important step in bringing together structured, unstructured and semi-structured data. That data can now be made available to Microsoft BI tools such as Power Pivot and Power Query, as well as third-party BI products such as Tableau and Cognos, providing more flexibility for conducting analytics and running comprehensive reports. Now when you use SQL Server 2016 as your data source, you have easy access to Hadoop and Azure Blob Storage data from whatever environment you're working in.
Learn more about Microsoft Analytics Platform System
Get to know StretchDB, another feature in SQL Server 2016
Listen to our podcast about SQL Server 2016 features