Microsoft SQL Server, PDW reach out to Hadoop for big data processing
Jason Sparapani, Associate Editor
“Big
data” is weighing on a lot of minds lately, including those at Microsoft. Last month, the
company released a community technology preview for two connectors to the open source distributed
computing framework Hadoop for big
data processing, one for SQL Server and one for SQL
Server Parallel Data Warehouse (PDW).
In this month’s “SQL in Five,” Microsoft database platform specialist Mark Kromer sheds light
on what the company has called its
Premium Access
Register now for unlimited access to our premium content across our network of over 70 information Technology web sites.
By submitting you agree to receive email from TechTarget and its partners. If you reside outside of the United States, you consent to having your personal data transferred to and processed in the United States.
Privacy
Dig Deeper
-
People who read this also read...
-
“first step” into the vast new world of big data. Kromer also
addresses Microsoft’s switch to
Open Database
Connectivity (OBDC) support for relational data access, what benefits and challenges the change
presents for developers and how it fits into the company’s push toward the cloud.
What big-data-processing capabilities is Microsoft hoping to deliver to customers with its
SQL Server connectors for Hadoop?
Mark Kromer: One use case that I’m familiar with for these adapters would be for a
business with big data requirements that is using Hadoop, perhaps with data stored in a scale-out
file system. [The business] can leverage its SQL Server investments by using SQL Server PDW and SQL
Server BI [business intelligence] to provide analytical insights into its big data. These are
two-way connectors, allowing you to move data between SQL Server and HDFS [Hadoop Distributed File
System] so that you can move large amounts of SQL Server data, say, from a large PDW-distributed
data warehouse, into Hadoop and likewise use SQL Server’s BI capabilities by analyzing Hadoop data
in SQL Server.
What challenges do processing such huge amounts of data hold for Microsoft customers?
Kromer: Businesses that have big data requirements like search engines (think Google,
Yahoo, Bing) or large social networking sites have a need to process super-large (aka “big”) data
sets very, very quickly. In these cases, it may be beneficial to utilize a distributed NoSQL
approach with tools like Hadoop and MapReduce, where the database schema is minimized with classic
SQL constructs like ACID [atomicity, consistency, isolation, durability] and referential integrity
put aside in favor of speed and easy data access. Microsoft is supporting our customers with big
data requirements with these connectors. There are also some very exciting projects coming out of
Microsoft Research and [Windows] Azure around distributed processing and big data. There is a white
paper that [Microsoft watcher] Andrew Brust published for Microsoft, talking about using
existing capabilities in Windows Azure, such as Azure Table Storage, for storing schema-lite
structured data in key [or] value pairs for easy and quick access.
Microsoft has called the release of the Hadoop connectors a “first step” on its big data
journey. What will the next step be?
Kromer: With these “beta” connectors, it’s too early to comment on the roadmap at this
point. Once we see a little more from the SQL Server community in terms of feedback and testing
with Hadoop and SQL Server, then we can have a clearer picture of the needs that businesses will
have. That feedback will help determine what the next steps should look like. While Hadoop
and MapReduce are currently very popular with businesses that have big data requirements, look for
continued Microsoft investments in big data and distributed programming. SQL Server PDW is the
first fully distributed database from Microsoft, albeit meant as an on-premises data warehouse. SQL
Azure is rolling-out SQL
Federations soon, which will allow you to distribute OLTP [online transaction processing]
database workloads and then you could use this feature as a way to distribute unstructured big
data, though with an associated database schema. And along those same lines, in terms of
distributed computing, the Windows HPC [High-Performance Computing] team just released LINQ to HPC for processing
big data sets by distributing LINQ operations across the nodes of an HPC cluster.
Microsoft recently announced it will focus on OBDC for SQL Server application programming,
ending support for Object Linking
and Embedding Database (OLE DB) after the release of the upcoming
version of SQL Server, code-named Denali. What drove this change?
Kromer: Some aspects of this decision likely are due to feedback from the SQL Server
community indicating that cross-platform support is a very important requirement, which would lead
one to develop solutions using ODBC instead of OLE DB. Another factor is Microsoft’s continued
investment in the Microsoft cloud platform, Windows Azure and the cloud database, SQL Azure. SQL
Azure is supported by the SQL Server ODBC driver, not by OLE DB. OLE DB support in SQL Server is
“deprecated” as of the Denali release, which means that it will be supported as a driver in SQL
Server Denali, but ODBC appears to be the recommended mechanism to use going forward.
What benefits will the OBDC shift bring for developers? What challenges does it
present?
Kromer: The benefits are primarily around cross-platform (non-Windows) support; it is not
a Microsoft-only technology and for cloud connectivity into SQL Azure. The challenges will be for
application developers and software vendors that do not support ODBC today to migrate their
applications, which is obviously much more of a challenge if you are not familiar with the ODBC
data access model. There is an FAQ
available with more on the shift to ODBC for SQL Server.
Mark Kromer has more than 16 years experience in IT and software engineering and is
well-known in the business intelligence (BI), data warehouse and database communities. He is the
Microsoft data platform technology specialist for the mid-Atlantic region. Check out his blog at MSSQLDUDE.