StampedeCon 2014 Presentations

StampedeCon 2014 was May 29-30.

View some of the StampedeCon 2014 slides at

Piloting Big Data: Where To Start?
John Akred
You know you have data. You know there are problems it can solve. But how do you bridge the gaps in between? We’ll describe a road-tested strategy for methodically moving from business problem, through an understanding of data requirements and technical capabilities, through to creating a targeted roadmap for your pilot project.

Jim BatesApache Drill: Self Service SQL for Big Data
Jim Bates
SQL is one of the most widely used languages to access, analyze, and manipulate structured data. As Hadoop gains traction within enterprise data architectures across industries, the need for SQL for both structured and loosely-structured data on Hadoop is growing rapidly. Apache Drill was started with the audacious goal of delivering consistent, millisecond ANSI SQL query capability across wide range of data formats, without the DBA ever having to build or maintain schemas. This session provides a quick introduction to Drill and showcases some interesting Big Data queries possible only with Drill.

Apache Spark: the next big thing?
Steven Borrelli
It’s been called the leading candidate to replace Hadoop MapReduce. Apache Spark uses fast in-memory processing and a simpler programming model to speed up analytics and has become one of the hottest technologies in Big Data.

In this talk we’ll discuss:

  • What is Apache Spark and what is it good for?
  • Spark’s Resilient Distributed Datasets
  • Spark integration with Hadoop, Hive and other tools
  • Real-time processing using Spark Streaming
  • The Spark shell and API
  • Machine Learning and Graph processing on Spark

Big Data: Infrastructure Implications for “The Enterprise of Things”
Jean-Luc Chatelain
The amount of data in our world has been exploding, and storing and analyzing large data sets—so-called big data—will become a key basis of competition for the new “Enterprise of Things”, underpinning fresh waves of productivity growth, innovation, and consumer surplus. Leaders in every sector – from government to healthcare to finance – will have to grapple with the implications of big data, as data growth continues unabated for the foreseeable future. The quest to make sense of all this big data begins with breaking down data silos within organizations using the cost appropriate, shared infrastructure to ensure optimal extraction and analysis of data, knowledge and insight.

As the leading global e-commerce service, PayPal has transformed the way the company leverages big data storage and hyper-scale analytics to help improve both the safety and purchasing experiences of its online customers. In this discussion, using real-world customer examples such as PayPal, we will explore what Big Data Storage is from high performance file sharing to long-term archiving, as well as ways to break down data silos to reduce the cost and storage complexity of managing demanding workflows and data environments. We will demonstrate how hyperscale storage can enable near-real-time, stream analytics processing for behavioral and situational modeling, as well as for fraud detection, marketing and systems intelligence. We will ask what the greatest barriers to effective business analytics are and how today’s data analytics platforms, including Hadoop, Vertica, Python and Java, can be optimized to enable machine learning, event streaming, forecasting, and reduce overhead associated with human intervention. You’ll come away from this session understanding the infrastructure implications and options for organizations looking to maximize their big data for competitive advantage.

Big Data Analytics made easy using Apache Hive to R Connector
Sukhendu Chakraborty
As the leading omni-channel personalization provider, RichRelevance fully harnesses the power of Hadoop to handle petabytes of data coming from both online (clickstream) and offline (e.g. in-store) sources. Given this wealth of customer data at Richrelevance, omnichannel data integration and  analytics is critical.  One of the major challenges is to consolidate online, mobile, social, and other data sources to create a create a single view of users for making  more insightful decisions.

Our use cases require clickstream analytics that leverage Apache Hive & R. Apache Hive is a good tool for performing ELT and basic analytics, but is limited in statistical analysis and data exploration capabilities. R, on the other hand, has become a preferred language for analytics, as it offers a wide variety of statistical and graphical packages. The downside is that R is single threaded memory intensive, making it impossible to work with data at scale.

Through a series of use cases, we will present how our version of the R to Hive connector allows us to bridge the gap between R and Hive and make big data analysis using R on terabytes of data feasible. This framework takes us a step closer to the notion of a “one solution fits all” principle where we are no longer restricted by a single compute mechanism. It is our attempt to bring the two worlds closer, such that the data source is agnostic to the tools which are used to access it.

How we solved large-scale offline data ingestion problem to power Omni-channel use-cases for Retailers
Murtaza Doctor
Offline data ingestion is an essential recipe of every platform and is the first step to solve a use-case. By simplifying the process of ingesting data feeds (ranging from product catalogs, point of sale, in-store sensor, social and many more forms of structured & semi-structured data) into the platform, we bring retailers one step closer to fulfilling their omni-channel vision.

We built our component with self-service in mind, in order to give control to the user. Our data ingestion component is built to perform schema discovery, validation and error reporting—such that a user can push any file to the platform without a definition of schema. The entire process is handled by launching custom workflows on the platform. But we don’t stop here, we have the ability to convert this file into various serialization formats like Avro, then create a Hive table where data can be further cleansed and transformed (ELT) using HiveQL.

We’ll discuss the entire architecture, design and technology choices, roadblocks and our lessons learned on the way on how the entire platform was built leveraging open source technologies in the big data eco-system, and how it has accelerated the business.

The Emergence of Smart Cities & Infinite Data
Paul Doherty
Cities are in the midst of a seismic shift that is staring right at them, yet the subtlety of this shift is very powerful due to its transparency. This is all about the “Internet of Things,” where trillions of connected devices will bring significant change and provide untold opportunities and tremendous challenges for business, government and you. Join us for an exploration of the latest implementations and cutting edge technologies that are generating an infinite amount of data in our urban environments and how Smart Cities are successfully capitalizing on this opportunity. Examples from Shanghai, Singapore, Abu Dhabi, Doha, Dubai, Jeddah, London and Kansas City will be showcased to provide a roadmap for you to move forward in this greatest of ages, the age of the Smart City.

HBase architecture and use cases
Sameer Farooqui
This is a fast-paced, vendor-agnostic technical overview of HBase, the Hadoop database. The talk is targeted towards technical people with previous relational database experience who want to understand the fundamentals of data modeling, querying and scaling a NoSQL database. The talk will cover the original use case at Google for BigTable, the read and write paths for HBase and how HBase scales using regions and region servers. Additionally, overview concepts for data modeling and ideal use cases will also be discussed.

Enabling Key Business Advantage from Big Data through Advanced Ingest Processing
Ronald Indeck
All too often we see critical data dumped into a “Data Lake” causing the data waters to stagnate and become a “Data Swamp”. We have found that many data transformation, quality, and security processes can be addressed a priori on ingest to enhance goodness and improve accessibility to the data. Data can still be stored in raw form if desired but this processing on ingest can unlock operational effectiveness and competitive advantage by integrating fresh and historical data and enable the full potential of the data. We will discuss the underpinnings of stream processing engines, review several relevant business use cases, and discuss future applications.

Big Data Governance: A Foundation for Moving Big Data Projects from Experiment to Production
Kenneth Jacquier
This session is geared towards attendees that want stronger Line of Business support to optimize Big Data in their company. Business interest in leveraging all your Data has increased through a few successful pilots. Let’s discuss how to build a culture that infuses analytics everywhere. We will explore how to leverage Big Data platforms to deliver business outcomes and spawn new business models. A strong focus will be placed on the role of Information Governance in delivering Big Data solutions that business leaders trust. What is required from your unstructured data governance investments to provide context and trust in your expanding Big Data ecosystem? In addition, what new Big Data governance capabilities need to be added to enable sustainable Business Agility while Mitigating Risk. We will explore these critical areas and review how other organizations have successfully made this transition.

Which database can handle a 3TB multidimensional array?
Alice Liang
At the Climate Corporation, we have a great demand for storing large amounts of raster-based data, and an even greater demand to retrieve small amounts of it quickly. We have a lot of weather data, elevation data, satellite imagery, and many other kinds of data. Doc Brown is our distributed, immutable, versioned database for storing big multidimensional data, and it is one of the core systems in production.

In this talk, we’ll take a look at the library design and features, the engineering challenges of managing versions, indexes, caching, while making it fast to ingest and to query data. We will also walk through an example of how to do quick development and validation of data.

Managing Genomes At Scale: What We Learned
Rob Long
Monsanto generates large amounts of genomic sequence data every year. Agronomists and other scientists use this data as input for predictive analytics to aid breeding and the discovery of new traits such as disease or drought resistance. In order to enable the broadest use possible of this valuable data, scientists would like to query genomic data by species, chromosome, position, and myriad other categories. We present our solutions to these problems, as realized on top of HBase here at Monsanto.We will be discussing our particular learnings around: flat/wide vs tall/narrow HBase schema design, preprocessing and caching windows of data for use in web based visualizations, approaches to complex multi-join queries across deep data sets, and distributed indexing via SolrCloud.

A Picture of Cassandra in the Real World
Nate McCall
In this presentation, we’ll look past the media and marketing hype surrounding Cassandra to provide useful information taken directly from a number of real world use cases.

We’ll discuss:

  • systems architectures and methods of deployment
  • performance characteristics of common workloads
  • operations tasks required for keeping clusters healthy
  • solutions to backup and recovery
  • how to fit Cassandra into common software development processes

Presented by a long-time Cassandra user and community member who led the development of the original Java driver back in the fall of 2009, this talk will be thorough, accurate, and informative. Attendees will come away with a better understanding of how to best leverage the power and operational characteristics of Cassandra in their architectures.

Datomic – A Modern Database
Alex Miller
Datomic is a distributed database designed to run on next-generation cloud architectures. Datomic stores facts and retractions using a flexible schema, consistent transactions, and a logic-based query language. The focus on facts over time gives you the ability to look at the state of the database at any point in time and traverse your transactional data in many ways.

We’ll take a tour of the Datomic data model, transactions, query language, and architecture to highlight some of the unique attributes of Datomic and why it is an ideal modern database.

Beyond a Big Data Pilot: Building a Production Data Infrastructure
Stephen O’Sullivan
Creating a data architecture involves many moving parts. By examining the data value chain, from ingestion through to analytics, we will explain how the various parts of the Hadoop and big data ecosystem fit together to support batch, interactive and realtime analytical workloads.

By tracing the flow of data from source to output, we’ll explore the options and considerations for components, including data acquisition, ingestion, storage, data services, analytics and data management. Most importantly, we’ll leave you with a framework for understanding these options and making choices.

Big Data Past, Present and Future – Where are we Headed?
Rob Peglar
Rob Peglar was one of the speakers at the very first StampedeCon. Following that talk two years ago, Rob will present an overview of and insight into the technologies and system approaches to computing, transport and storage of big data – where we’ve been, are now and are headed. There is a major ‘fork in the road’ upcoming in the treatment and business application of big data and the technology that surrounds it, one that is important enough to change the course of the methodologies and approaches used by large and small business alike, especially for the infrastructure required either on premise or in the cloud.

Big Data Ethics
Neil M. Richards
Big data, broadly defined, is producing increased powers of institutional awareness and power that require the development of a Big Data Ethics. We are building a new digital society, and the values we build or fail to build into our new digital structures will define us. Critically, if we fail to balance the human values that we care about, like privacy, confidentiality, transparency, identity and free choice with the compelling uses of big data, our Big Data Society risks abandoning these values for the sake of innovation and expediency. First, we’ll trace the origins and rapid growth of the Information Revolution.
Jonathan King (co-presenter)
Second, we’ll call for the development of a “Big Data Ethics,” a set of four related principles that should govern data flows in our information society, and inform the establishment of big data norms such as confidentiality, transparency, and the protection of identity. Finally, we’ll suggest how we might integrate big data ethics into our society. Law will be an important part of Big Data Ethics, but so too must the establishment of ethical principles and best practices that guide government, corporations, and users. We must all be part of the conversation, and part of the solution. Big Data Ethics are for everyone.

Storm – Streaming Data Analytics at Scale
Scott Shaw
Storm’s primary purpose is to provide real-time analytics against fast moving data before its stored. The use cases range from fraud detection, machine learning, to ETL.
Storm has been clocked at over 1 million tuples processed per second per node. It’s fast, scalable, and language agnostic. This session provides an architecture overview as well as a real-world discussion of its use and implementation at Enterprise Holdings.

Intel’s Big Data and Hadoop Security Initiatives
Todd Speck
In this talk, we will cover various aspects of software and hardware initiatives that Intel is contributing to Hadoop as well as other aspects of our involvement in solutions for Big Data and Hadoop, with a special focus on security. We will discuss specific security initiatives as well as our recent partnership with Cloudera. You should leave the session with a clear understanding of Intel’s involvement and contributions to Hadoop today and coming in the near future.

GPUs in Big Data
John Tran
Modern graphics processing units (GPUs) are massively parallel general-purpose processors that are taking Big Data by storm. In terms of power efficiency, compute density, and scalability, it is clear now that commodity GPUs are the future of parallel computing. In this talk, we will cover diverse examples of how GPUs are revolutionizing Big Data in fields such as machine learning, databases, genomics, and other computational sciences.

Securing Big Data: Instrumentation, visibility and control
Karl Wehden
Big data challenges present new security challenges to organizations, new data users, and new technologies. How does this effect my security framework and strategy? How do I build an understanding of the threat model? What is the best method to measure associated risk? These are questions that need to be answered as the volume, variety and velocity of data continues to increase. In this session, IBM will talk through the changing landscape of big data security; its key threats, common controls, and share lessons learned in securing and managing risk associated with big data.

Making Machine Learning work in Practice
Kilian Q. Weinberger
Here I will go over common pitfalls and tricks on how to make machine learning work.



The Evolution of Data Analysis with Hadoop
Tom Wheeler
This session will lead the audience through the evolution of data analysis in Hadoop to illustrate its progression from the original low-level, batch-oriented MapReduce approach to today’s higher-level interactive tools that require very little technical knowledge.  We’ll discuss Apache Crunch, Hive, Impala and Solr.

While the nature of this talk is somewhat technical, no prior knowledge of Hadoop or any specific programming language is required. Frequent live demonstrations of the tools discussed will emphasize that analyzing data in Hadoop can be as easy as using a relational database or Internet search engine.

Big Data Panel


  • Ed Domain – Founder and COO at Techli


  • Brian Schwartz – CTO at BarrelFish
  • Dheeraj Patri – Chief Technology Officer at FoodEssentials
  • Bob Ward – CTO at Juristat

Theme: Big Data is great, but creating smart data can help you generate results quickly. Having a clear understanding of goals and objectives for what you are trying to accomplish when you are starting a project that could involve Big Data will help reduce the amount of Big Data that you use/need. Approaching Big Data correctly will help you only access the data that you need which will allow you to move faster and if necessary, scale your architecture.