When David DeWitt suggested in October an interest in contributing to a superior database management system, it seemed that a breakthrough was under way. The occasion was the Professional Association for SQL Server’s PASS Summit in Seattle, where DeWitt is a fan favorite and perennial keynote speaker. His central theme: The galaxy of data has come to feature huge swaths of unstructured data, most of which land in the world of NoSQL...
databases, but the SQL camp should not feel overshadowed. As a veteran SQL expert and Microsoft technical fellow, DeWitt predictably effused about the pivotal role of SQL in the handling of the two types of data.
While this is not a knockdown battle for supremacy, the SQL camp can feel marginalized when discussion turns to improved management of unstructured data. “In recent years data people felt that they needed to move away from SQL to be effective,” said David Menninger, an analyst at Ventana Research in San Ramon, Calif. “However, integrating the two provides the most opportunities.”
More on Microsoft and NoSQL databases
Read how Microsoft SQL Server is reaching out to Hadoop for data processing
Microsoft works to deploy Hadoop on Windows and SQL Azure
Hadoop is being folded into SQL Server
NoSQL, shorthand for “not only SQL,” was popularized by Google’s open source approach, which emerged from the need to absorb data, such as clickstream, with exceptional scalability. It has grown with the unremitting, cascading presence of unstructured data. SQL is not without some answers for unstructured data, but the flexibility of NoSQL makes it a natural fit for the bulk of unstructured data.
For the record, DeWitt made no product announcements during his speech at the conference. What’s more, he insisted that no database paradigm shift was imminent -- developments in data management systems would not be revolutionary in the way that network and hierarchic models gave way to the relational database.
Still, DeWitt said mounting quantities of unstructured data amid heightened demand for speed and refined analytics expose the inherent inefficiencies of NoSQL and Apache Hadoop, an open source distributed computing framework. The use of the Hadoop Sqoop import tool to move data from relational databases to NoSQL databases requires excessive scanning and, in the end, renders limited performance. Microsoft recently plugged connectors to Hadoop in SQL Server 2008 R2 and will package them with the upcoming database release, SQL Server 2012, due for a “virtual launch” March 7.
“There has to be a better way that is more efficient and more powerful than just a bridge,” DeWitt said. His project, dubbed “Enterprise Data Manager,” is based on SQL Server parallel database technology and will boast improved scalability, fault tolerance and the ability to analyze large quantities of unstructured data. “We are going to try to build one, so stay tuned,” he said.
At the time of this writing Microsoft management has muzzled DeWitt, who balances the persona of the academic with that of the tractable employee. Details of the “Enterprise Data Manager” are not readily available, except that it is being carried out at Microsoft’s lab near the University of Wisconsin-Madison, where DeWitt is professor emeritus. Consider Microsoft’s official word on the subject: “SQL Server Parallel Data Warehouse offers customers high scalability to hundreds of terabytes, scalable performance and complete data warehouse platform thanks to integration with Microsoft business intelligence tools as well as complementary tools for master data management and streaming data.”
The open source arena is increasingly commanding the thrust of innovation. Google, Facebook and Amazon rely heavily on open source, while Oracle, Microsoft and IBM find themselves grudgingly nudged into it. So the next-generation database management systems could continue to gain traction outside the confines of proprietary environments.
Michael Stonebraker, another heavy hitter in SQL circles, is banking on open source in a big way. Two years ago he co-founded VoltDB, a Massachusetts-based tech company that’s putting forward a highly scalable, OLTP-streamlined SQL database that is open source. “High performance at low cost is a great way to go,” he said. “Open source, over time, will take over everything.”
If Stonebraker’s forecast is extreme, he would appear to be riding the predominant trend. The flowering of the Hadoop ecosystem in recent years has placed open source models -- and NoSQL -- in pole position for handling unstructured data, Stonebraker said. The bedrock of SQL has been the schema. With the capability to extract, transform and load data and with database attributes known as ACID, the SQL model powers forward with ironclad consistency. The enormous volume of unstructured data, much of it amorphous and of minimal value, has allowed the NoSQL realm to explore and expand. Yet the current galaxy of “big data” is anything but static. At the PASS Summit, DeWitt said these times represent a “golden age for database people.” That’s because of the wealth of prime-time opportunities surfacing across the spectrum.
Such opportunities spring from deficiencies of the Hadoop ecosystem, said Mark Kromer, database platform specialist at Microsoft. “A lot of projects trying to improve on parts of Hadoop are in the early stages, so it’s to hard say what changes will take place.”
Given their limited resources and the formidable challenge of handling data of unprecedented scale, the Hadoop pioneers achieved greatness. But clearly, strong performance has demanded higher priority.
Adapting to prevailing systems is the most common approach to innovation. Companies like San Francisco-based Splunk have developed solutions based on the need for speed and advanced analytics applied to unstructured data, or “machine data,” as the company calls it. The software is essentially an upgraded proprietary version of MapReduce, which is the software computing framework for Hadoop.
Synthesys, by Digital Reasoning in Franklin, Tenn., also hangs its hat on speed -- as in real-time advanced analytics, efficiency and flexibility -- in the context of oceans of unstructured text. Synthesys uncovers buried entities, or people, places, things and events (see “Unstructured data, structured data -- what does it all mean?”) in unstructured data and shows relationships among the entities. Innovations like these show the acceleration of improved handling of unstructured data, Menninger said. “The broader market can benefit from advanced analytics and the better speed and accuracy of the results,” he said.
As unstructured data becomes more tangible, more manageable and more valuable, the spheres of SQL and NoSQL will necessarily expand their common ground. The Hadoop ecosystem, as DeWitt gleefully observed, incorporates substantial SQL components. Both Hive, a Hadoop-based data warehouse developed by Facebook, and Pig, a Hadoop-based language developed by Yahoo, are semi-declarative and are distinctively “SQL-like.” This is no idle distinction, DeWitt said; of Facebook’s 150,000 daily jobs, only 500 run on MapReduce. The remainder runs on Hive. Another notch in the belt for SQL.
The gulf between SQL and NoSQL can resemble the divide between liberals and conservatives. The mindsets are different, the priorities near opposite. SQL is the conservative father, meticulous and reliable. NoSQL is the young son, carefree, fast and flexible. Founder of Athena IT Solutions Rick Sherman said the two sides typically don’t enjoy the other’s type of work. “Mutual co-existence is the way it is,” he said. Perhaps this divide will impede innovation of next-generation database. Perhaps NoSQL’s dominance of unstructured data will pose a high hurdle to DeWitt and his “Enterprise Data Manager.”
Can DeWitt, in one master stroke, create a SQL database with excellent scalability and fault tolerance? Can his next-generation database management system also expertly handle unstructured data?
DeWitt told his audience to “stay tuned” for news of his project, suggesting progress sooner rather than later. As for a suggested name of the project, “The Golden Bear” rolls off the tongue easier than “Enterprise Data Manager.”
Unstructured data, structured data -- what does it all mean?
As unstructured data increasingly represents opportunities for profitable innovation, it is seeping into popular lexicon. The term is bandied about in conversations about “big data,” which grows day by day in interest and relevance. It is now not uncommon for the accurate import of unstructured data to fall by the wayside. That’s because it sometimes becomes confused with semi-structured data. Further clouding terminology here, unstructured data is also being used to refer to both unstructured and semi-structured data.
What is unstructured data?
Structured data is organized in semantic pieces, called entities. Similar entities are grouped together. Entities in the same group—or schema—share the same attributes. The fields of the attributes are rigidly fixed in a file, like in a spreadsheet. All structured data share defined formats and follow the same order.
What is unstructured data?
Semi-structured data has no strict formatting and no specific database engine. Attributes are less predictable as data is organized in semantic entities. The order of attributes of semi-structured data is frequently unimportant. Furthermore, the size and type of same attributes in a group can be different. Most examples of unstructured data derive from the Internet, such as Web pages and email.
Roger du Mars is a freelance writer based in Redmond, Wash. He has written for publications such as Time magazine, USA Today and The Boston Globe, and he was the Seoul, South Korea, bureau chief of Asiaweek and the South China Morning Post.