SQL Server-Hadoop highway to 'big data' ventures into new territory
Robert Sheldon, Contributor
Microsoft has made little secret of its move into the realm of “big data” --
particularly the massive sets of unstructured files that companies like Google and Yahoo have been
grappling with since their inceptions. Google, for instance, processes 20 petabytes (20,000 terabytes) of
data each day, much of that in the form of text-based index files. Yet big data is hardly limited
to indexes. Corporations regularly manage large volumes of emails, documents, Web server logs,
social networking feeds and an assortment of other unstructured information.
To get a handle on all their data, companies such as Autodesk, IBM
Premium Access
Register now for unlimited access to our premium content across our network of over 70 information Technology web sites.
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States.
Privacy
Dig Deeper
-
People who read this also read...
-
This was first published in November 2011
and Facebook -- along with Google and
Yahoo,
not surprisingly -- have implemented
Apache Hadoop, an open
source cloud platform designed to manage unstructured data sets too large for traditional tools.
Microsoft has taken notice of all this Hadoop hoopla and forged its own SQL Server-Hadoop
connection.
Recently, the company released connectors that let organizations move large amounts of data
between Hadoop clusters and SQL Server 2008 R2, Parallel
Data Warehouse or the upcoming release SQL
Server 2012, formerly known as Denali. Because the connectors allow data to move in both
directions, businesses can take advantage of the storage and data processing power of SQL Server and still
leverage Hadoop’s ability to manage large, unstructured data sets.
Yet the SQL Server-Hadoop connectors aren’t quite what SQL Server users are accustomed to. The
connectors, it turns out, are command-line
tools implemented in a Linux environment. As a result, SQL Server users planning to implement
one of the connectors will benefit from having a conceptual overview of how they fit into the Hadoop
environment.
The Apache Hadoop cluster
Hadoop is a master-slave architecture
implemented on a cluster of Linux computers. To process such massive volumes of data, the Hadoop
environment must include the following components:
- The master node manages the slave nodes and the tasks related to processing, managing
and accessing the data files. The master node also serves as the primary access point for outside
applications making job requests into the Hadoop environment. The node is also referred to as the
master server.
- The name mode runs the NameNode daemon, which manages the Hadoop
Distributed File System (HDFS) namespace and regulates access to the data files. The node
supports operations such as opening, closing and renaming files as well as determining how to map
blocks of data on the slave nodes. In smaller environments, the name mode can be implemented on the
same server as the master node.
- Each slave node runs the DataNode daemon, which manages the storage of data files and
processes read and write requests to those files. The slave nodes are made up of commodity hardware
components that are relatively inexpensive and readily available, making it possible to run
parallel operations on thousands of computers.
Figure 1 illustrates how the various components interact within the Hadoop environment. Notice
that the master node runs the JobTracker daemon and each slave node runs the TaskTracker daemon.
JobTracker processes requests from client applications and assigns those tasks to the various
instances of TaskTracker. When it receives instructions from JobTracker, TaskTracker works with the
DataNode daemon to run the assigned tasks and handle the movement of data during each phase of the
operation.
Figure 1. You must implement the SQL Server-Hadoop connector within the Hadoop
cluster.
The MapReduce framework
As Figure 1 also illustrates, the master node supports the framework necessary to run MapReduce operations.
MapReduce technology lies at the core of the Hadoop environment. In fact, you can think of Hadoop
as a MapReduce
framework, with components such as JobTracker and TaskTracker playing critical roles within
that framework.
MapReduce breaks large data sets into small, manageable chunks that can be spread across
thousands of computers. It also provides the mechanisms necessary to perform numerous parallel
operations, search petabytes of data, manage complex client requests and perform in-depth analyses
on the data. In addition, the MapReduce structure provides the load balancing and fault tolerance
needed to ensure that operations are completed quickly and accurately.
The MapReduce framework works hand-in-hand with the HDFS framework, which stores each file as a
sequence of blocks. Blocks are replicated across the cluster for fault tolerance, and except for
the last block, all blocks within a file are the same size. The DataNode daemon on each slave node
works with HDFS to create, delete and replicate blocks. However, an HDFS file can be written to
only once and by only one writer at a time.
SQL Server-Hadoop connector implementation
You implement the SQL Server-Hadoop connector on the master node in the Hadoop cluster.
However, the master node must also be configured with Sqoop
and Microsoft’s Java
Database Connectivity driver. Sqoop is an open source, command-line tool used to import data
from a relational database, transform the data using the Hadoop MapReduce framework and then export
the data back into the database.
When the SQL Server-Hadoop connector is also installed on the master node, you can use Sqoop to
import or export SQL Server data. Note, however, that Sqoop and the connector are operating within
a Hadoop-centric view of your data. That means when you use Sqoop to import data, you’re retrieving
data from a SQL Server database and adding it to the Hadoop environment, but when you export data,
you’re retrieving data from Hadoop and sending it to SQL Server.
Sqoop lets you import SQL Server data into or export Hadoop out of one of the following three
storage types:
- Text files: Basic text files delimited with commas, tabs or other supported
characters.
- SequenceFiles: Binary files that contain serialized record data.
- Hive tables: Tables in a Hive data warehouse, a special
warehousing infrastructure built on top of Hadoop.
Together, SQL Server and the Hadoop environment (MapReduce and HDFS) enable users to process
large amounts of unstructured data and integrate that data into a structured environment that
supports reporting, analysis and business intelligence.
Microsoft doesn’t stop there
The SQL Server-Hadoop connectors represent significant inroads into Microsoft’s
quest to manage big data. As a result, Microsoft has had to embrace the open source world in a
big way, given that Hadoop, Linux and Sqoop are all open source technologies. Yet Microsoft is
looking even farther down the road. Later this year, the company is expected to debut a Hadoop
derivative solution that will run as a service on the Windows Azure cloud
platform.
Next year the company plans to release the same type of service for Windows Server. However,
releasing the Hadoop connectors has in itself been a significant step forward. Businesses can
immediately start incorporating big data into their SQL Server environments, and Microsoft can keep
looking for better ways to incorporate that data.
ABOUT THE AUTHOR
Robert Sheldon is a technical consultant and the author of numerous books, articles and training
materials related to Microsoft Windows, various relational database management systems, and
business intelligence design and implementation. Find more information at http://rhsheldon.com.
Disclaimer:
Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.
Join the conversationComment
Share
Comments
Results
Contribute to the conversation