A New Analytic Technology Stack for Scalable, Interactive Analysis

I was thrilled to see the public announcement of Databricks last week and Spark taking off with strong support from Andreessen Horowitz. Spark, for those who haven’t heard of it yet, is an open source cluster computing framework that is designed to make data analytics fast and boasts performance numbers 100x faster than traditional MapReduce in some benchmarks. The technology was started by the Berkeley AMPLab and is now taking off in large production implementations including places like Yahoo and Airbnb that run analysis at massive scale and speed. Read more about where and why Spark came about.

This move by Matei Zaharia and Ion Stoica to establish Databricks as a well-funded company, is an excellent development for Spark, Shark and the rest of the components in the AMPLab BDAS stack. It also signifies it’s importance to the future of fast, scalable data processing and big data platforms from Hadoop to large volumes of data from traditional relational structures and streaming sources of data.

Here at ClearStory, we’ve been big fans of the AMPLab BDAS stack and this technology for quite some time. We recognize that the future of data processing at scale really needs a new model for processing and a whole new user experience and application model to access it through. Simply put, Spark is a framework that makes working with large data sets in memory easy. It is expressive enough with the right level of abstractions which are very well-suited for iterative workloads on large data sets. ClearStory’s underlying platform uses Spark as one component, around which we’ve done a lot of work around optimizers, API’s, job scheduling, query management and such.

But for us, Spark is just a starting point. Our users need to bring a multitude of disparate sources together with lightning speed for fast-cycle analysis across lots of diverse data, in a system that can automatically harmonize data for interactions free of the traditional, old “business intelligence” pre-modeling approach. That approach just doesn’t allow for fast, iterative analysis across many disparate data sources and is fraught with latency in the data pipeline. This in itself has made the problem of getting from the point of a ‘business question’ to an ‘answer,’ a long cumbersome series of data handling steps and as a result the answer arrives too late.

Spark helps us attain raw speed when it comes to iterative analysis, but it is the components that we have built around Spark that suits it to this new model of analaytic interactivity. And perhaps most interestingly, it is a unique user application and user model, that allows for a non-technical user to navigate through this analysis that really differentiates our approach on top of some very powerful platform technology components, of which Spark is one piece.

The platform underlying this new model also has to assist the end user with intelligently navigating their data. Presenting the most relevant data in the most optimal format possible to ensure that end users are getting accurate insights. For this, we need to completely understand the shape, structure and semantics of data, feeding this data back into the analytic engine for optimization. Here we have developed significant components around another open source project, Storm.

Storm was born out of Twitter for processing massive amounts of streaming data. The distributed stream processing of Storm allows us to access many disparate data sources easily – of all types, databases, Hadoop clusters, streaming APIs. We continuously manage these sources so we can make relevant optimizations on query execution and updates at query time. It is just another example of how we’ve rethought what end users really need when accessing and analyzing data that is often fast-changing from the source, derived from many disparate sources, and almost never predictable in advance.

We’re looking forward to sharing more of our technology, application and users in the coming weeks.