Microsoft has made little secret of its move into the realm of “big data” -- particularly the massive sets of unstructured
files that companies like Google and Yahoo have been grappling with since their inceptions. Google, for instance, processes 20 petabytes (20,000 terabytes) of data each day, much of that in the form of text-based index files. Yet big data is hardly limited to indexes. Corporations regularly manage large volumes of emails, documents, Web server logs, social networking feeds and an assortment of other unstructured information.
To get a handle on all their data, companies such as Autodesk, IBM and Facebook -- along with Google and Yahoo, not surprisingly -- have implemented Apache Hadoop, an open source cloud platform designed to manage unstructured data sets too large for traditional tools. Microsoft has taken notice of all this Hadoop hoopla and forged its own SQL Server-Hadoop connection.
Recently, the company released connectors that let organizations move large amounts of data between Hadoop clusters and SQL Server 2008 R2, Parallel Data Warehouse or the upcoming release SQL Server 2012, formerly known as Denali. Because the connectors allow data to move in both directions, businesses can take advantage of the storage and data processing power of SQL Server and still leverage Hadoop’s ability to manage large, unstructured data sets.
Yet the SQL Server-Hadoop connectors aren’t quite what SQL Server users are accustomed to. The connectors, it turns out, are command-line tools implemented in a Linux environment. As a result, SQL Server users planning to implement one of the connectors will benefit from having a conceptual overview of how they fit into the Hadoop environment.
The Apache Hadoop cluster
Hadoop is a master-slave architecture implemented on a cluster of Linux computers. To process such massive volumes of data, the Hadoop environment must include the following components:
- The master node manages the slave nodes and the tasks related to processing, managing and accessing the data files. The master node also serves as the primary access point for outside applications making job requests into the Hadoop environment. The node is also referred to as the master server.
- The name mode runs the NameNode daemon, which manages the Hadoop Distributed File System (HDFS) namespace and regulates access to the data files. The node supports operations such as opening, closing and renaming files as well as determining how to map blocks of data on the slave nodes. In smaller environments, the name mode can be implemented on the same server as the master node.
- Each slave node runs the DataNode daemon, which manages the storage of data files and processes read and write requests to those files. The slave nodes are made up of commodity hardware components that are relatively inexpensive and readily available, making it possible to run parallel operations on thousands of computers.
Figure 1 illustrates how the various components interact within the Hadoop environment. Notice that the master node runs the JobTracker daemon and each slave node runs the TaskTracker daemon. JobTracker processes requests from client applications and assigns those tasks to the various instances of TaskTracker. When it receives instructions from JobTracker, TaskTracker works with the DataNode daemon to run the assigned tasks and handle the movement of data during each phase of the operation.
Figure 1. You must implement the SQL Server-Hadoop connector within the Hadoop cluster.
The MapReduce framework
As Figure 1 also illustrates, the master node supports the framework necessary to run MapReduce operations. MapReduce technology lies at the core of the Hadoop environment. In fact, you can think of Hadoop as a MapReduce framework, with components such as JobTracker and TaskTracker playing critical roles within that framework.
MapReduce breaks large data sets into small, manageable chunks that can be spread across thousands of computers. It also provides the mechanisms necessary to perform numerous parallel operations, search petabytes of data, manage complex client requests and perform in-depth analyses on the data. In addition, the MapReduce structure provides the load balancing and fault tolerance needed to ensure that operations are completed quickly and accurately.
The MapReduce framework works hand-in-hand with the HDFS framework, which stores each file as a sequence of blocks. Blocks are replicated across the cluster for fault tolerance, and except for the last block, all blocks within a file are the same size. The DataNode daemon on each slave node works with HDFS to create, delete and replicate blocks. However, an HDFS file can be written to only once and by only one writer at a time.
SQL Server-Hadoop connector implementation
You implement the SQL Server-Hadoop connector on the master node in the Hadoop cluster. However, the master node must also be configured with Sqoop and Microsoft’s Java Database Connectivity driver. Sqoop is an open source, command-line tool used to import data from a relational database, transform the data using the Hadoop MapReduce framework and then export the data back into the database.
When the SQL Server-Hadoop connector is also installed on the master node, you can use Sqoop to import or export SQL Server data. Note, however, that Sqoop and the connector are operating within a Hadoop-centric view of your data. That means when you use Sqoop to import data, you’re retrieving data from a SQL Server database and adding it to the Hadoop environment, but when you export data, you’re retrieving data from Hadoop and sending it to SQL Server.
Sqoop lets you import SQL Server data into or export Hadoop out of one of the following three storage types:
- Text files: Basic text files delimited with commas, tabs or other supported characters.
- SequenceFiles: Binary files that contain serialized record data.
- Hive tables: Tables in a Hive data warehouse, a special warehousing infrastructure built on top of Hadoop.
Together, SQL Server and the Hadoop environment (MapReduce and HDFS) enable users to process large amounts of unstructured data and integrate that data into a structured environment that supports reporting, analysis and business intelligence.
Microsoft doesn’t stop there
The SQL Server-Hadoop connectors represent significant inroads into Microsoft’s quest to manage big data. As a result, Microsoft has had to embrace the open source world in a big way, given that Hadoop, Linux and Sqoop are all open source technologies. Yet Microsoft is looking even farther down the road. Later this year, the company is expected to debut a Hadoop derivative solution that will run as a service on the Windows Azure cloud platform.
Next year the company plans to release the same type of service for Windows Server. However, releasing the Hadoop connectors has in itself been a significant step forward. Businesses can immediately start incorporating big data into their SQL Server environments, and Microsoft can keep looking for better ways to incorporate that data.
ABOUT THE AUTHOR
Robert Sheldon is a technical consultant and the author of numerous books, articles and training materials related to Microsoft Windows, various relational database management systems, and business intelligence design and implementation. Find more information at http://rhsheldon.com.