“Big data” is weighing on a lot of minds lately, including those at Microsoft. Last month, the company released a community technology preview for two connectors to the open source distributed computing framework Hadoop for big data processing, one for SQL Server and one for SQL Server Parallel Data Warehouse (PDW).
In this month’s “SQL in Five,” Microsoft database platform specialist Mark Kromer sheds light on what the company has called its “first step” into the vast new world of big data. Kromer also addresses Microsoft’s switch to Open Database Connectivity (OBDC) support for relational data access, what benefits and challenges the change presents for developers and how it fits into the company’s push toward the cloud.
What big-data-processing capabilities is Microsoft hoping to deliver to customers with its SQL Server connectors for Hadoop?
Mark Kromer: One use case that I’m familiar with for these adapters would be for a business with big data requirements that is using Hadoop, perhaps with data stored in a scale-out file system. [The business] can leverage its SQL Server investments by using SQL Server PDW and SQL Server BI [business intelligence] to provide analytical insights into its big data. These are two-way connectors, allowing you to move data between SQL Server and HDFS [Hadoop Distributed File System] so that you can move large amounts of SQL Server data, say, from a large PDW-distributed data warehouse, into Hadoop and likewise use SQL Server’s BI capabilities by analyzing Hadoop data in SQL Server.
What challenges do processing such huge amounts of data hold for Microsoft customers?
Kromer: Businesses that have big data requirements like search engines (think Google, Yahoo, Bing) or large social networking sites have a need to process super-large (aka “big”) data sets very, very quickly. In these cases, it may be beneficial to utilize a distributed NoSQL approach with tools like Hadoop and MapReduce, where the database schema is minimized with classic SQL constructs like ACID [atomicity, consistency, isolation, durability] and referential integrity put aside in favor of speed and easy data access. Microsoft is supporting our customers with big data requirements with these connectors. There are also some very exciting projects coming out of Microsoft Research and [Windows] Azure around distributed processing and big data. There is a white paper that [Microsoft watcher] Andrew Brust published for Microsoft, talking about using existing capabilities in Windows Azure, such as Azure Table Storage, for storing schema-lite structured data in key [or] value pairs for easy and quick access.
Microsoft has called the release of the Hadoop connectors a “first step” on its big data journey. What will the next step be?
Kromer: With these “beta” connectors, it’s too early to comment on the roadmap at this point. Once we see a little more from the SQL Server community in terms of feedback and testing with Hadoop and SQL Server, then we can have a clearer picture of the needs that businesses will have. That feedback will help determine what the next steps should look like. While Hadoop and MapReduce are currently very popular with businesses that have big data requirements, look for continued Microsoft investments in big data and distributed programming. SQL Server PDW is the first fully distributed database from Microsoft, albeit meant as an on-premises data warehouse. SQL Azure is rolling-out SQL Federations soon, which will allow you to distribute OLTP [online transaction processing] database workloads and then you could use this feature as a way to distribute unstructured big data, though with an associated database schema. And along those same lines, in terms of distributed computing, the Windows HPC [High-Performance Computing] team just released LINQ to HPC for processing big data sets by distributing LINQ operations across the nodes of an HPC cluster.
Microsoft recently announced it will focus on OBDC for SQL Server application programming, ending support for Object Linking and Embedding Database (OLE DB) after the release of the upcoming version of SQL Server, code-named Denali. What drove this change?
Kromer: Some aspects of this decision likely are due to feedback from the SQL Server community indicating that cross-platform support is a very important requirement, which would lead one to develop solutions using ODBC instead of OLE DB. Another factor is Microsoft’s continued investment in the Microsoft cloud platform, Windows Azure and the cloud database, SQL Azure. SQL Azure is supported by the SQL Server ODBC driver, not by OLE DB. OLE DB support in SQL Server is “deprecated” as of the Denali release, which means that it will be supported as a driver in SQL Server Denali, but ODBC appears to be the recommended mechanism to use going forward.
What benefits will the OBDC shift bring for developers? What challenges does it present?
Kromer: The benefits are primarily around cross-platform (non-Windows) support; it is not a Microsoft-only technology and for cloud connectivity into SQL Azure. The challenges will be for application developers and software vendors that do not support ODBC today to migrate their applications, which is obviously much more of a challenge if you are not familiar with the ODBC data access model. There is an FAQ available with more on the shift to ODBC for SQL Server.
Mark Kromer has more than 16 years experience in IT and software engineering and is well-known in the business intelligence (BI), data warehouse and database communities. He is the Microsoft data platform technology specialist for the mid-Atlantic region. Check out his blog at MSSQLDUDE.