Data + AI  |  Vaibhav Nivargi  |  January 20, 2013

Big Data: From Batch Processing to Interactive Analysis

‘Big Data’ is either very popular these days, or infamous, depending on who you ask, but it certainly has everyone’s attention. For the most part, it has also become synonymous with Hadoop, and for good reason. Hadoop and its primary programming model, MapReduce, are great for batch-oriented processing of huge amounts of data. With growing data, Hadoop enables you to horizontally scale your cluster by adding commodity nodes and thus keep up with query workloads.

But MapReduce’s batch-oriented processing can only go so far. Problems with high latency of execution, real-time and streaming data management, an API that is too low-level and involved for mass adoption, and other issues have resulted in a dramatic evolution of the platform itself. This evolution has forced the addition of support for higher level languages (Pig & Hive), new real-time storage engines (HBase), extensions for streaming data (Hadoop Streaming), and the most recent addition, Impala, from Cloudera.

(As a digression, it is interesting how many of these transformations and alternatives mirror Google’s own evolution beyond the MapReduce paradigm, with systems like DremelTenzingPowerDrill and F1.)

Today Hadoop can be evaluated in a new light. Hadoop is moving beyond MapReduce’s initial success as a batch-oriented, better way to support ETL. In an enterprise architecture rife with a sea of Hadoop-to-database connectors, technologists realize that a new approach is necessary. New Apache™ projects are focused on evolving Hadoop beyond MapReduce and into a formidable Big Data stack.

There’s also a flurry of alternatives introduced to solve one or more weaknesses in Hadoop MapReduce’s initial footprint. These include projects like the Dremel-inspired Drill project (or Google BigQuery), Spark & Shark from the Berkeley AMPLab BDAS stack, and Storm from Twitter. I’ll get deeper into my experience with these, and others, in upcoming blog posts.

There are many other promising projects outside of the ones mentioned above, each with their strengths, which is why Hadoop MapReduce is only the beginning of Big Data analytics. But the other alternatives don’t quite deliver when it comes to delivering a big data solution for the way business users want to work. Business users want a solution to derive actionable, trustworthy insights from diverse data sources, rapidly, and in a repeatable fashion.

At the core of such a solution is a data processing engine which has a flexible data model that can work with diverse data sources at scale; an expressive execution engine that subsumes both batch processing and real-time, interactive processing; and one that supports advanced operators without overwhelming users with too much complexity. At ClearStory Data, we are building such a solution, and are excited to start sharing the details. Watch this space as we continue to reveal more about what we do and how we do it.

Related Blogs

Data + AI Tim Howes August 12, 2015
Get in the Flow: Data Intelligence for Everyone on Google Cloud Platform
Our customers want to know what’s happening now in their businesses. Increasingly, many prefer cloud services to make data consumption easier, faster, broader and more…
Data + AI Tim Howes June 15, 2015
Spark Speeds Towards the Next Data Processing Revolution
A mighty flame followeth a tiny spark. — Dante Alighieri If you know anything about Apache Spark, you know that its chief claim to fame is speed.…
Data + AI Vaibhav Nivargi March 19, 2015
A Little Spark to Wildfire
Open Source Project Birthed at U.C. Berkeley Takes Off in the Enterprise This week the fast-growing Apache Spark community is gathering in New York City to celebrate and collaborate…