This article was originally posted by Edd Dumbill of Silicon Valley Data Science and re-posted here with Edd’s permission. We thought it would be a great complement to these other StampedeCon presentations:
- Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Engine by Richard Williamson (Silicon Valley Data Science)
- Lifting the hood on Spark Streaming by Andrew Psaltis (Shutterstock)
- How Cisco Migrated from MapReduce Jobs to Spark Jobs by Ken Owens (Cisco)
- Workshop: Deep Dive into Apache Cassandra & Apache Spark by Jon Haddad (Datastax)
The Apache Spark big data processing platform has been making waves in the data world, and for good reason. Building on the progress made by Hadoop, Spark brings interactive performance, streaming analytics, and machine learning capabilities to a wide audience. Spark also offers a more developer-friendly and integrated platform to a field in which self-assembly of components has been the norm.
As an upcoming contender for big data computation, Spark has many advantages, notably speed and developer convenience. But how do you know whether Spark can help you? Here are some common use cases that we’ve seen in the field at Silicon Valley Data Science, working hands-on with Spark.
Streaming ingest and analytics
Spark isn’t the first big data tool for handling streaming ingest, but it is the first one to integrate it with the rest of the analytic environment. You can use the same code for streaming analytic operations as you can for batch, and use Spark to compute over both the stream and historical data. This increases productivity, consistency, and maintainability of analytic procedures. Spark is friendly with the rest of the streaming data ecosystem, supporting data sources including Flume, Kafka, ZeroMQ, and HDFS.
One of the headline benefits of using Spark is that you no longer need to maintain different environments for exploratory and production work. The relatively long execution times of a Hadoop MapReduce job make it difficult for hands-on exploration of data: data scientists typically still must sample data if they want to move quickly. Thanks to the speed of Spark’s in-memory capabilities, interactive exploration can now happen completely within Spark, without need for Java engineering or sampling of the data. Whether the development language of choice is SQL, R, Python, or Scala, Spark interfaces to each. Platforms such as Databricks Cloud and Apache Zeppelin aim to bring this power to a browser based interface, similar to the iPython Notebook beloved of data scientists. Spark will even run on a data scientist’s laptop, meaning they can attack those “small data” use cases with exactly the same tooling.
Model building and machine learning
Spark’s status as a big data tool that data scientists find easy to use makes it ideal for building models for analytical purposes. In a pre-Spark world, big data modelers typically built their models in a language such as R or SAS, then threw them to data engineers to re-implement in Java for production on Hadoop. This workflow limits efficiency, extends iteration time, and is a nexus for potential errors. With Spark, the same platform is used for model building and deployment, making the process much more efficient, and allowing data scientists hands-on insight into model performance.
Through its fast in-memory computing and its incorporation of MLlib, Spark is ideally suited to machine learning applications. As well as performance, Spark brings both a reliable library of algorithms, consistent access to data in a unified platform, and the ability to work interactively through the Spark shell.
By incorporating the GraphX component, Spark brings all the benefits of using its environment to graph computation: enabling use cases such as social network analysis, fraud detection, and recommendations. While there are other graph databases to choose from, they typically involve multiple systems to build out the entire computation pipeline. Spark’s integration of the platform brings flexibility and resilience to graph computing, as well as the ability to work with graph and non-graph sources.
Simpler, faster, ETL
Though less glamorous than the analytical applications, ETL is often the lion’s share of data workloads. The speed and concise-code advantages of Spark apply to this domain as well, eliminating the need for multiple Hadoop MapReduce jobs that entail a large amount of slow disk access. If the rest of your data pipeline is based on Spark, then the benefits of using Spark for ETL are obvious, with consequent increases in maintainability and code-reuse.
Apache Spark is a significant move forward for big data technology. While maintaining compatibility with the extended big data universe, it successfully integrates the essential technologies required in a data analytics pipeline. This means we can do away with the complexity of multiple specialized systems with brittle interconnects and little reusability. In addition to this integration, Spark’s developer-friendly nature, interactive performance. and fault tolerance gives it key advantages in productivity, maintainability, and operational expense.
While Spark is not a panacea—as a young technology, it is evolving quickly and has rough edges and areas of incompleteness—it is the obvious candidate for an industry-standard big data platform.
My thanks to John Akred for collaborating in writing this article.