On the Meaning of Spark 1.0

spark1-0The Spark 1.0 release marks some significant changes, and it is worth taking some time to consider just what this release means for users and developers of Spark and what they can now expect from the project.

The answer to that question has clearly changed over time, and not all of that change is completely new with Spark 1.0.  Rather, the project has evolved through several important stages in its short but very active history. Spark has steadily progressed from its beginnings as a project internal to the AMPLab, to an open-source project with a rapidly growing number of contributors, through a successful transition to an Apache Incubator project while retaining the vibrancy of its developer community, and then quickly demonstrated sufficient maturity in its development process to graduate and become a top-level Apache project.

Along the way, the larger Spark project has gained a meetup group that meets regularly in the San Francisco Bay Area and has become so popular that it may soon need to be split into more than one group to accommodate everyone who wants to attend and its gaining similar user groups in several other cities across the globe. It has established a presence at the Strata conferences with well-attended presentations and tutorial sessions and has been a key element of multiple AMP Camps that introduced large numbers of developers to Spark and other work being done at the AMPLab. It has produced its own, dedicated technical conference, Spark Summit, which will be in its second iteration this week and now has the support of Databricks, the company created by the founders of spark and faculty from the AMPLab who have been involved with Spark from its earliest days. In short, even before any of the specific technical developments that are new in its 1.0 release, Spark has matured well to become an established and reliable software platform and strong development community with fast growing recognition from enterprise users.

Now with the 1.0 release, Spark’s developers have committed to uphold important new expectations. Throughout its history, Spark’s developers have demonstrated admirable discipline in maintaining a consistent API.  As of 1.0, Spark’s core API is no longer just a matter of good developer intentions and practices, but instead is now formally guaranteed to adhere to semantic versioning. With the embrace of this established practice in version numbering, developers building on a particular major version of the Spark platform can reliably know that their code will continue to run unchanged with any subsequent maintenance/bugfix releases or with new minor releases that introduce new API calls in addition to bug fixes. While these are presently source-code level guarantees, Spark’s developers are also committed to strive for binary-level compatibility among future versions of Spark’s core.

There are large areas of code outside of Spark’s core platform; and while these are now clearly marked as also being outside of the new compatibility guarantees, their presence in the 1.0 release not only gives early opportunities to use it in its present state, but also provides a broad indication of what new functionality to expect will be present under Spark’s semantic versioning guarantees in future releases: SQL-based querying of structured data using SparkSQL, graph-based analytics using GraphX and a large standard library of machine-learning algorithms and utilities in MLlib.

It is easy to enumerate specific technical improvements present in Spark 1.0 or on its roadmap; but more than any of those specifics, it is the robust maturity of Spark’s development process and infrastructure that is most noteworthy and gives new meaning to the project with this release. That evident maturity comes at an inflection point in Spark’s evolution where the key questions to be asked are now clearly different than they were only a short time ago.  For example, where not too long ago commentators on the data analytics space could sound reasonably well-informed by posing the question, ‘Does Spark have a future?’, Now, the more informed question must be at least some form of, ‘Just how big will Spark’s future be?’

Many of the questions about Spark have changed, but there are still questions that must be answered to make effective use of Spark. At ClearStory we have already spent a considerable amount of time and effort answering those questions while deeply integrating Spark into our Data Intelligence solution with great performance results. So while this point in Spark’s progress has been an appropriate one at which to pause and reflect on what has changed for Spark at its 1.0 release, we’ll also have a lot more to say in future posts about how we use Spark.

Please join us this Monday and Tuesday (June 30th & July 1st, 2014) at the Spark Summit in San Francisco.   ClearStory Data will be at Booth #8.   Come on by and see more.