ClearStory + Spark = Data Exploration Freedom

The release of Spark 1.0 marks a significant step in the move away from MapReduce based big data processing. In-memory. Distributed. Scale out. Machine Learning. 100X faster – on initial benchmarks and our Spark-inside Solution is evidence of the blazing speed. Data Scientists and Data Engineers are rejoicing – and drooling. Although for some users and use cases, Spark doesn’t yet possess all of the functionality ultimately required. That’s only to be expected. Even relatively distant future versions of Spark are not likely to be a solution in itself accessible on their own for many would-be users.  Don’t confuse a processing framework with an end-user enterprise-ready analysis application. That’s because with the 1.0 release and likely for some time beyond, the core Spark technology is far more an open toolkit than it is an end-use product.

Spark 1.0 and Spark-based Solutions:

While Spark 1.0 comes with some noteworthy improvements to its monitoring and management capabilities, configuration and administration scripts, etc., any open, big data, distributed computing framework is unavoidably going to remain a complex beast that requires a broad range of skills, a number of crucial decisions, and a significant amount of detailed, hands-on work to create and maintain a complete, working solution to any particular set of data analytics problems.

Few vendors have fully used Spark and deeply invested in IP and applications around it. However, for those just starting to consider it, there are the typical sets of questions you need to ask including:

  • How large of a cluster do you need, and how should each node be configured?
  • Should you install and run Spark on your own hardware, or should you use a cloud-based system such as EC2?
  • Should you use Mesos or Yarn to do cluster resource management, or should you use Spark’s stand-alone mode?
  • Scala, Java or Python?
  • How will your jobs be best expressed using Spark’s API?
  • How will you get the relevant data into your Spark cluster?

And on, and on. There are already a number of options available to ease use of Spark, and we can certainly expect that there will be more in the future.  These range from scripts packaged with Spark that can make creating and running a basic Spark cluster on EC2 relatively quick and painless, to Spark packaged for deployment and management in the distributions of established Hadoop vendors, to support and training options offered by those vendors or some newer third-party offerings, to some Spark functionality made accessible through legacy data analytics products and services.

ClearStory’s Data Intelligence solution, while deeply integrating and leveraging Spark, insulates the user from the demanding details of installing, running and maintaining a Spark cluster. However, this is just the beginning. The power Spark had, at first, satisfied our requirements for crunching large data sets. This allowed us to really focus on our main goal of Intelligent and Automated Disparate Data Harmonization. Intelligent Data Harmonization  is how we bring together disparate data sources automatically without the need for any data modeling or sampling or tedious, costly rounds of data preparation. With this we open up a completely new user experience for fast data exploration and discovery-based analysis. To make this Harmonization intelligent and seamless to the business user we infer and collect metadata from these data sources.  All of this metadata information is used to optimize the harmonization and to display the results. And with Spark under the hood it’s blazingly fast. Users can now try a combination of data sets that have never been imagined only because the traditional method (ETL through IT) takes too long and it only gives you so many chances to get it right after which you hit a wall and accept the limitations which should not be the case. Now we have Data Exploration Freedom. The number of Data Islands in enterprises keeps growing and this growth will accelerate as companies continue to adopt the cloud. Once you can instantly harmonize multiple sources, we needed to build in an intelligent recommendation and data scoring system to guide the user to which of these may be most useful for this particular analysis.  Again, we leverage the rich set of metadata in this process. We also chose to innovate heavily in the area of Data-Aware Collaboration so you can track how this data exploration unfolded with your colleagues, which leads to telling a fast, interactive story with your data. Traditionally, data analysts used tools to explain an insight or metrics on a dashboard view that they have already thought of. Now we are seeing users collaborate together to pick more relevant data and from a variety of internal and externally available sources to create totally new data stories as they explore and discover new insights.

Spark is here to stay and we can predictably expect great things of Spark now and in the future. However, that doesn’t mean that every question in big data analytics is now and will inevitably be answered by Spark processing and Spark alone.  It’s exciting to be on the cutting edge of this new era and to be building the next generation of analysis technology to bring data exploration freedom to everyone.