What is Apache Spark Used For?

June 18, 2015


This article was originally posted by Edd Dumbill of Silicon Valley Data Science and re-posted here with Edd’s permission.  We thought it would be a great complement to these other StampedeCon presentations:

The Apache Spark big data processing platform has been making waves in the data world, and for good reason. Building on the progress made by Hadoop, Spark brings interactive performance, streaming analytics, and machine learning capabilities to a wide audience. Spark also offers a more developer-friendly and integrated platform to a field in which self-assembly of components has been the norm.

As an upcoming contender for big data computation, Spark has many advantages, notably speed and developer convenience. But how do you know whether Spark can help you? Here are some common use cases that we’ve seen in the field at Silicon Valley Data Science, working hands-on with Spark.   Continue reading “What is Apache Spark Used For?”

Solving Large-scale Offline Data Ingestion Challenges

May 14, 2014

richrelevanceNew mobile, social, sensor and click-stream data from consumers, the Internet of Things and even the “Enterprise of Things” is creating opportunities but also challenges for organizations who have decided to try to harness this data’s potential.  One such challenge is large-scale offline data ingestion.

At StampedeCon 2014, Murtaza Doctor, Principal Architect at RichRelevance, will describe a platform that his team has built to provide a self-service data ingestion platform that provides schema discovery, validation and error reporting as well as integration with Hive for ELT processes.   Continue reading “Solving Large-scale Offline Data Ingestion Challenges”

Will big data allow the right to be forgotten?

May 13, 2014

lockIn a decision that will have big data usage implications, the European Union Court of Justice has ruled that users have the right to request Google remove search results which contain private, or otherwise sensitive data. The search engine giant will be obliged to remove the data, “…unless there are particular reasons not to, such as the result and its data is in the public interest”, however how that would be determined was not addressed. Continue reading “Will big data allow the right to be forgotten?”

Scaling R with Hive via Pluggable Query Generation

May 12, 2014

RlogoApache Hive is a good tool for performing ETL and basic analytics but is limited in statistical analysis and data exploration capabilities. R, on the other hand, has become a preferred language for analytics, as it offers a wide variety of statistical and graphical packages. The downside is that R is single threaded and memory intensive thus making it difficult to work with data at scale.

At StampedeCon 2014 on May 29-30 in St. Louis, RichRelevance’s Senior Software Engineer Sukhendu Chakraborty will discuss the R to Hive connector package they’ve developed to meet their need of using R on terabytes of data feasibly while maintaining R code that is independent of the data source.

A number of existing open-source packages, such as Rhive, RHadoop and RHIPE address R’s scalability to different extents.  And, with some work, there are also ways to link R to optimized multi-threaded libraries.  In a detailed overview of RichRelevance’s approach, Chakraborty sites these advantages and differentiators of having an R to Hive connector: Continue reading “Scaling R with Hive via Pluggable Query Generation”

Spark: The Leading Candidate to Replace Hadoop MapReduce?

May 12, 2014

spark-logoIs Apache Spark the next big thing? According to Stephen Borrelli, Founder of Asteris (a next-gen infrastructure startup built on Spark+Mesos+Docker), Spark has been called the leading candidate to replace Hadoop MapReduce.

“Apache Spark uses fast in-memory processing and a simpler programming model to speed up analytics and has become one of the hottest technologies in Big Data,” says Borrelli.

In his talk at StampedeCon 2014 (May 29-30 in St. Louis), he’ll discuss:

  • What is Apache Spark and what is it good for?
  • Spark’s Resilient Distributed Datasets
  • Spark integration with Hadoop, Hive and other tools
  • Real-time processing using Spark Streaming
  • The Spark shell and API
  • Machine Learning and Graph processing on Spark

Check out the StampedeCon 2014 presentations for more details on Stephen Borrelli’s Spark presentation as well as talks on Storm streaming data analtyics, Apache Crunch, Cloudera’s Impala and more.

Piloting Big Data: Where To Start?

May 12, 2014

Multi-channel DataYou know you have data–perhaps in various data silos.  You may even feel that there is more useful data that you could be collecting. You know there are problems it can solve. But how do you bridge the gaps in between? Even after experimenting with Big Data tools, it may still be unclear how to map your current data architecture and analytics processes to the capabilities available in current Big Data tools.

At StampedeCon 2014 in St. Louis, John Akred, CTO of Silicon Valley Data Science and former Big Data R&D Lead at Accenture, will describe a road-tested strategy for methodically moving from business problem, through an understanding of data requirements and technical capabilities, through to creating a targeted roadmap for your Big Data pilot project.

Join us at StampedeCon 2014 to hear John’s talk and a number of others designed to help you determine how to best leverage Big Data within your organization.

Opening Image Source: https://www.flickr.com/photos/opensourceway/5265955179/