Big Data: From Batch Processing to Interactive Analysis

‘Big Data’ is either very popular these days, or infamous, depending on who you ask, but it certainly has everyone’s attention. For the most part, it has also become synonymous with Hadoop, and for good reason. Hadoop and its primary programming model, MapReduce, are great for batch-oriented processing of huge amounts of data. With growing data, Hadoop enables you to horizontally scale your cluster by adding commodity nodes and thus keep up with query workloads.

But MapReduce’s batch-oriented processing can only go so far. Problems with high latency of execution, real-time and streaming data management, an API that is too low-level and involved for mass adoption, and other issues have resulted in a dramatic evolution of the platform itself. This evolution has forced the addition of support for higher level languages (Pig & Hive), new real-time storage engines (HBase), extensions for streaming data (Hadoop Streaming), and the most recent addition, Impala, from Cloudera.

(As a digression, it is interesting how many of these transformations and alternatives mirror Google’s own evolution beyond the MapReduce paradigm, with systems like Dremel, Tenzing, PowerDrill and F1.)

Today Hadoop can be evaluated in a new light. Hadoop is moving beyond MapReduce’s initial success as a batch-oriented, better way to support ETL. In an enterprise architecture rife with a sea of Hadoop-to-database connectors, technologists realize that a new approach is necessary. New Apache™ projects are focused on evolving Hadoop beyond MapReduce and into a formidable Big Data stack.

There’s also a flurry of alternatives introduced to solve one or more weaknesses in Hadoop MapReduce’s initial footprint. These include projects like the Dremel-inspired Drill project (or Google BigQuery), Spark & Shark from the Berkeley AMPLab BDAS stack, and Storm from Twitter. I’ll get deeper into my experience with these, and others, in upcoming blog posts.

There are many other promising projects outside of the ones mentioned above, each with their strengths, which is why Hadoop MapReduce is only the beginning of Big Data analytics. But the other alternatives don’t quite deliver when it comes to delivering a big data solution for the way business users want to work. Business users want a solution to derive actionable, trustworthy insights from diverse data sources, rapidly, and in a repeatable fashion.

At the core of such a solution is a data processing engine which has a flexible data model that can work with diverse data sources at scale; an expressive execution engine that subsumes both batch processing and real-time, interactive processing; and one that supports advanced operators without overwhelming users with too much complexity. At ClearStory Data, we are building such a solution, and are excited to start sharing the details. Watch this space as we continue to reveal more about what we do and how we do it.