Sergey Nivens - Fotolia

News Stay informed about the latest enterprise technology news and product updates.

Cask framework aims to speed Azure HDInsight data pipeline builds

A link between Cask Data's CDAP application and integration environment and Azure HDInsight, Microsoft's Hadoop cloud service, is meant to cut development time on big data applications.

Clouds like Microsoft's Azure strive to simplify deployment, but it can be as hard to get end-to-end big data analytics apps up and running on the cloud as it is in the data center. An application data framework from Cask Data Inc. is meant to speed such implementations, and Azure HDInisight is one of the intended targets.

The framework looks to address difficulties that arise as big data analytics is turned into big data workflows and applications.

"From a big data pipeline perspective, a lot goes into doing something like clickstream analysis. Customers want to be able to build these applications much faster," said Pranav Rastogi, principal program manager at Microsoft.

With the Cask Data Application Platform (CDAP), users can create an end-to-end big data pipeline, he said, and creating such workflows, with multiple big data components, is a core problem with big data today.

Cask CTO Nitin Motgi said the company's software helps to break down flows of data into logical processing pieces that run in, for example, MapReduce or Spark, a type of work that can be very time-consuming for even the best of Java developers.

"People are having skills challenges, especially with low-level APIs," he said. The Cask environment seeks to move development to a higher level of abstraction via a model-oriented interface. Meanwhile, he said, CDAP uses underlying container architecture. It organizes data, applications and programs to run on Hadoop.

The Microsoft and Cask technologists discussed CDAP and its role in speeding deployment of big data lynchpins like Apache Hadoop and Apache Spark in a recent webcast.

That followed word at the recent Strata + Hadoop World in New York that CDAP has been certified for Azure and is now available to run on Azure HDInsight, the mainstay of Microsoft's Hadoop and Spark efforts.

Extract the value

Forrester Research analyst Mike Gualtieri said combinations like CDAP and Azure HDInsight can address issues that hold back wider use of big data analytics. He said adoption of data and analytics technologies has been slow because it has been hard for business to extract value. Big data technologies, he insisted, only become business enablers when they become part of actual applications.

"People think of Hadoop as a data lake or Spark as a data analytics system, but working applications are key," he said. "What Cask does is it lets you work in terms of applications." The system should be thought of more as an application platform than an analytical platform, he suggested.

He likened the effect to early application servers like WebLogic, which brought pieces of middleware together to create applications beginning in the 1990s. "Instead of an application server, you now have an application cluster which acts as your analytic system."

Like the WebLogic application server, which became a prominent hub for development and is now owned by Oracle, CDAP provides a useful level of abstraction for building applications, he said.

Beyond plain, vanilla Hadoop

Microsoft has been steadily working to create its own tools for big data pipeline development and management for Hadoop and Spark on the Azure cloud. But the company has shown some interest in getting some outside help, as seen in its deal with Cask.

In fact, much of Microsoft's HDInsight work was accomplished together with Hortonworks, one of the leading Hadoop distributors and -- not incidentally -- the beneficiary of Microsoft investment over the years. Also, earlier this year, the big data analytics platform maker introduced a version of its Datameer Cloud running on Microsoft's Azure HDInisght.

For its part, Microsoft offers Azure Data Factory as a big data integration service on the cloud. Microsoft's Rastogi admitted "at a high level, Data Factory and CDAP are solving the same problem." There are scenarios, however, where one may have more sources than the other, he said.

There is work ahead on some roadmap alignment that would see use of Azure Data Factory as part of CDAP pipelines, said Cask's Motgi. Clearly, there is a lot of building ahead. A release due later this year, CDAP 4, will include prebuilt pipelines such as Simple Storage Service to Azure Storage and SQL Server to HBase for HDInsight, again, with the goal of vastly speeding big data application deployment.

Getting more data on Azure is a goal for Microsoft, and the CDAP certification helps in that. The hope is, once the data is onboard, familiar Microsoft analytics tools can be put to work on the data.

Forrester's Gualtieri marked analytics as an area where Microsoft Azure was notably competitive with Amazon Web Services. While he pointed out Amazon's overall lead in cloud, he noted Azure HDInsight scored somewhat ahead of Amazon's comparable Hadoop services in its Forrester Wave study on big data cloud services done in the second quarter of this year. The advantage, he said, derived from the analytics tools that Microsoft has.

"Hadoop by itself is plain old vanilla," he said. "So, you look for the 'value add.' Amazon's approach is to build the basic capabilities, while Microsoft's approach is to bring BI [business intelligence] tools to bear on the problem. This is reflected in the investments they have made."

Next Steps

Learn how to set up Azure subscriptions

Find out about SQL-Azure migration

Get started with the Microsoft SQL Azure database

Dig Deeper on SQL Server Database Modeling and Design