StampedeCon 2014 Training

StampedeCon 2014 training information

May 27, 2014

May 28, 2014


Workshop # 1

Understanding the NoSQL Landscape and Architecture
(May 27, 8:00 AM – 12:00 PM), $299

This is a fast paced, technical overview of the NoSQL landscape. The objectives for this training include the following:

  • Introduce students to the core technical concepts of Big Data
  • Provide a general overview of the most common NoSQL stores
  • Explain how to choose the correct NoSQL database for specific use cases from a technical perspective
  • Fundamentals of the architecture of Cassandra, HBase, MongoDB and Neo4J
  • Quick overview of Hadoop (HDFS/MapReduce) within the context of NoSQL
  • Familiarize students with the emerging architectures in the world of NoSQL: Impala, Drill, Stinger initiative, Storm
  • Live demos of the column family databases Cassandra and HBase. Both databases will be running in the cloud and students will be given the URL to log into the management framework (web UI) for both databases.

Who Should Attend:
This workshop is targeted towards both technical and non­technical professionals who want to understand the emerging world of Big Data. No prior knowledge of databases or programming is assumed. Engineers, Programmers, Networking specialists, Managers and Executives should plan on attending.


Training Session Outline:

  • The data deluge: the generators of Big Data (structured vs. unstructured data)
  • The limitations of SQL/RDBMS
  • The white papers that started it all: GFS, MapReduce, BigTable, Dynamo
  • CAP Theorem: Consistency, Availability, Partition Tolerance
  • NoSQL flavors: Key­Value stores (Voldemort, Dynamo), Key­Data stores (Redis), Key­Document stores (CouchDB, MongoDB, Riak), Column Family stores (BigTable, HBase, Cassandra), Graph stores (Neo4J, HyperGraphDB)
  • Quick Comparison of licenses for different open source NoSQL databases
  • How to pick a NoSQL database● Use cases for various NoSQL databases
  • Recommendations on resources for continued learning about NoSQL after the workshop


Workshop # 2

Hadoop Distributed File System (HDFS) and MapReduce
(May 27, 1:00 PM – 5:00 PM), $299

In this workshop you will understand the essentials of Hadoop Distributed File System(HDFS) that provides high­throughput access to application data as well as the system used for parallel computational processing of large data sets. This will be a fast paced technical deep dive into HDFS and MapReduce. No programming code review will be done. Instead, the focus is on understanding the architecture of the Hadoop JVMs (NameNode, DataNode, JobTracker, TaskTracker), how they interact with each other, what happens when failures arise and performance optimization recommendations. By the end of the class, engineers will have a clear, technical idea of where Hadoop fits nicely into a project and where it is a bad idea. A live Hadoop cluster in the cloud will be used to demo writes into HDFS and reads using MapReduce. Attendees will be provided a link to Cloudera’s Hadoop Manager web UI to interact with the Hadoop cluster via a browser while the demos are running. Many books, links, blog posts and YouTube videos will be referenced so attendees have a clear path of resources for where to continue learning about Hadoop after the workshop.

Who Should Attend:
Engineers, Programmers, Networking specialists who want to use Hadoop for developing,
deploying and managing Big Data applications, solutions and products. This technical session will get engineers started on their path to Hadoop, however managers and executives are welcome to sit in and learn some of the inner workings of this emerging technology.

No previous knowledge of Hadoop is assumed.

Training Session Outline:


  • Linux File system options
  • NameNode & DataNode architecture
  • Write Pipeline
  • Read Pipeline
  • HDFS Shell Commands
  • Hadoop’s Network Topology
  • Administration fundamentals:Heartbeats, Block Reports, Rack Awareness, Block Scanner, Balancer, Health Check, hdfs­site.xml file
  • Exploring the HDFS Web UI


  • MapReduce Architecture: JobTracker/TaskTracker
  • Combiner
  • Partitioner
  • Shuffle and Sort
  • Speculative Execution
  • Job Scheduling
  • Input/Output formats
  • Thinking in the MapReduce way
  • Exploring the MapReduce Web UI
  • Intro to Monitoring and Debugging on a production cluster
  • Classic Use case: Word Count in MapReduce
  • Structured Data Use Case: Analyzing web traffic logs with MapReduce
  • Unstructured Data Use Case: Facial Recognition against CCTV video files using MapReduce


Workshop # 3

(May 28, 8:00 AM to 12:00 PM), $299

While MapReduce is powerful, it’s complex and perhaps not so intuitive for all audience. It
requires Java programming experience and a fundamental understanding of distributed systems. As ex­SQL DBAs start to enter the world of Big Data, Hive is typically their first step. Hive converts what looks like SQL commands into MapReduce jobs to be executed on a massive Hadoop cluster. Hive was originally started at Facebook about 5 years ago to help Facebook’s data analysts quickly query structured data sets. Almost 90% of queries into the Hadoop cluster at Facebook arise out of Hive. There are certain architectural tricks Hive uses to give Hadoop a SQL­like wrapper. In this workshop you will learn about these tricks and do a hands­on lab for 1 hour on Hive. In the lab, students will get access to a live Hadoop cluster in the cloud. The lab will walk you through loading 1 million movie reviews data into Hive and analyzing it by doing joins, counts and various SELECT queries. By the end of the workshop, you will see that using Hadoop doesn’t necessarily require a deep understanding of the HDFS or MapReduce internals, but just some SQL familiarity.

Who Should Attend:
Engineers, Programmers, Networking specialists with SQL experience who want to use Hadoop for developing, deploying and managing Big Data applications, but don’t have any previous programming experience. Managers and Executives could also pick up some interesting takeaways about how Hive makes Hadoop adoption painless, however the material covered will be technical and more geared towards engineers who want some hands on experience with Hive.

No previous knowledge of Hadoop is assumed.

Training Session Outline:

  • Hive philosophy and architecture
  • Hive vs. RDBMS
  • HiveQL and Hive Shell
  • Managing tables
  • Data types and schemas
  • Querying data
  • Partitions and Buckets
  • Intro to User Defined Functions
  • Lab: Analyzing movie reviews with Hive


Workshop # 4

(May 28, 1:00 PM – 5:00 PM), $299

Cassandra was open sourced by Facebook in 2009 after their engineers married concepts from Amazon’s Dynamo database with Google’s BigTable database. Since then Cassandra has become the leading column family database (ahead of HBase) and is perhaps the fastest moving open source project in 2013. This technical workshop will cover the inner workings of this popular NoSQL database and jumpstart the attendees on the fundamental concepts in Cassandra. A sample 3­node Cassandra running in Rackspace’s cloud will be used for live demos. Writing data to Cassandra using CQL, flushing the memtables, and using OpsCenter will be demonstrated. Additionally, the instructor will provide links to further training materials that the attendees can use to hone their skills.

Who Should Attend:
Engineers, Programmers and Networking specialists who are interested in the inner workings of Cassandra’s architecture. Managers and Executives could also pick up some interesting takeaways about where Cassandra is a good use case fit and where it’s not, however the material covered will be technical and more geared towards engineers.


Training Session Outline:

  • Traditional Ring vs VNodes
  • Partitioners: Murmur3 vs MD5
  • Column Family data modeling
  • Write Path: Memtables, Commit Log and SSTables
  • Read Path: Memtables, Caches, Bloom Filters and SSTables
  • Data Replication Strategies
  • Internals: Hinted handoff, Tombstones
  • Cassandra.YAML configuration file
  • Nodetool utility
  • CQL
  • DataStax and OpsCenter


About the Instructor

Sameer Farooqui
Sameer Farooqui is a freelance big data consultant and trainer with deep industry expertise in the Hadoop and NoSQL domain. For the past five years, he has deployed various clustering software packages internationally to clients including fortune 500, governments, hospitals and banks. In the past year, he has taught over 50 courses, of which about 30 were on­site. All of the curriculum Sameer teaches, along with the lectures and labs, were custom developed by him.

Most recently he was a Systems Architect at Hortonworks where he specialized in designing Hadoop prototypes and proof-­of-­concept use cases. While at Hortonworks, Sameer also taught Hadoop Developer’s classes and visited various customers as a sales engineer to brainstorm use cases. The core Hadoop products he specializes in are HDFS, MapReduce, HCatalog, Pig, Hive, HBase and Zookeeper.

Previously, Sameer worked at Accenture’s Silicon Valley R&D lab where he was responsible for studying NoSQL databases, Cloud Computing and MapReduce for their commercial applicability to emerging big data problems. At Accenture Tech Labs, Sameer was the lead engineer for creating a 32­-node prototype using Cassandra and AWS to host 10 TB of Smart Grid data. He also worked on a 30+ person team in the design phase of a multi­-environment Hadoop cluster pilot project at NetApp.

Before Hortonworks and Accenture, Sameer spent five years at Symantec where he deployed Veritas Clustering and Storage Foundation solutions (VCS, VVR, SF­HA) to Fortune 500 and government clients throughout North America.

Sameer is a regular speaker at Big Data conferences and meetups.