Spark Speeds Towards the Next Data Processing Revolution

A mighty flame followeth a tiny spark. — Dante Alighieri

If you know anything about Apache Spark, you know that its chief claim to fame is speed. With in-memory processing, Spark promises ten- or hundred-fold improvements or more in data processing times over traditional MapReduce. Spark is also more flexible, supporting a wider variety of workloads than traditional systems. But “faster” and “more flexible” don’t do justice to the data processing sea change enabled by Spark.

First, let’s examine speed. When I first got on the Internet it was through a 300-baud modem. When 1200 baud came along, life got better, and when I upgraded to 2400 baud I thought I had died and gone to heaven. But those changes were incremental. The emails I was sending and files I was transferring went faster, but my horizons had not broadened much. When broadband arrived, it was a different story. The quantum improvement in speed it brought, along with the fact that it was always on, enabled vast new application areas that simply did not exist before. Video, gaming, streaming music, real-time communication, video conferencing and more suddenly took the Internet by storm. These applications in turn enabled and attracted a new class of Internet user, and the modern Internet era was born. Not until the recent mobile explosion was another revolution as significant seen.

We stand on the brink of a similar revolution in data-processing applications, all driven by the quantum improvement in speed enabled by Spark. Suddenly the world of data analytics can be interactive, rather than batch-oriented. Combine interactivity with more intelligent software and more user-friendly tools, and a new class of front-office business users is coming to data analytics. Brand managers, ad campaign managers, marketing managers, and executives of all types can now get fast insights into their business directly.

It’s hard to overstate the importance of this change. Before Spark, business users had to rely on IT for their insights. They didn’t have the skills required and just as important, they didn’t have the time to spend wrangling data, writing queries and waiting for results. The typical time to answer a question was measured in days, and only then after all the required systems were set up. That first-time setup could easily take months. While Spark does not solve all the problems of data setup, it does enable business users to ask questions and get answers interactively, while they wait, without days of waiting and people wrangling. This is huge. For the first time, business users have the power to answer their own questions based on data analysis.

Spark fundamentally changes the types of questions that can be asked. As the speed of business keeps accelerating, the value of knowing what happened pales in comparison to the value of knowing what is happening. Right now. Imagine the power of knowing how your ad campaign is performing this very minute; of knowing where sales are spiking or tanking every day; of knowing where inventory is low or high as sales are made; of knowing which patients are responding to care and which are not. All right now.

Spark enables these kinds of insights through the combination of its in-memory, lightning-fast processing and its ability to stream in new data in real time. Spark Streaming enables a new class of real-time data applications that open up a whole new world of possibilities. Real-time sales data, Internet of Things data, campaign-performance data and more, all streamed into Spark and to the screens of business users at the speed of Spark.

It’s clear Spark is aptly named, as it is sparking the next revolution in data analytics and fast-cycle big data processing. Like bandwidth, Spark is an enabler of great things to come. It makes the things we’re already doing faster, but more importantly it opens up new vistas that were only imagined before. It will take a lot of work and application development to get there, but the spark has been set. The fire is sure to follow.

Read the original guest post on IBM Big Data Hub and on Twitter @ibmanalytics.