StampedeCon 2016 Technical Workshops

Below you will find information on the technical workshops we are offering as part of the StampedeCon 2016 Big Data Conference. We are offering Analysis with Big Data Tools and Build a Real-Time Streaming IoT Solution.

Analysis with Big Data Tools

Throughout the workshop, hands-on exercises reinforce the topics being discussed. These will be performed in a virtual machine provided in VMWare format. Students can download the VMWare Player for Windows or Linux. Mac users can run the virtual machine using VMWare Fusion.  The VM for the workshop will need at least 3 GB of RAM and 1 CPU core.

Date: July 26th, 2016
Time: 8:00am to 5:00pm
Location: City View Ballroom at the St. Louis City Center Hotel, 400 S 14th St, St. Louis, MO 63103
Presenter: Tom Wheeler

Cloudera University StampedeCon 2016 Big Data Conference Technical Workshop

About the Workshop

For several decades, traditional relational databases have provided a convenient way to analyze data. Newer tools, offering increased scalability and greater flexibility, have helped meet the demand arising from fundamental changes in the amount and type of data that organizations now produce. During this one-day workshop, you will learn about the core components of Hadoop, how to exchange data between traditional databases and a Hadoop cluster, how to leverage existing SQL experience with Big Data analysis tools like Hive and Impala, how organizations are using machine learning techniques, and how to make sure you choose the right analysis tool for the job.

About the Presenter

Tom Wheeler’s career spans more than twenty years in the communications, biotech, financial, healthcare, aerospace, and defense industries. He is currently Principal Curriculum Developer at Cloudera, where he works on material used to train the world’s next generation of data professionals. Before joining Cloudera in 2011, he developed engineering software at Boeing, helped to design and implement a high-volume data processing system for WebMD, and created financial applications for brokerage firm A.G. Edwards.

Prerequisites

This course is designed for data analysts, business intelligence specialists, developers, system architects, and database administrators. Basic knowledge of SQL is recommended, but prior experience with Apache Hadoop or related tools is not required.  The workshop excercises will be performed in a virtual machine provided in VMWare format. Students can download the VMWare Player for Windows or Linux. Mac users can run the virtual machine using VMWare Fusion.  The VM for the workshop will need at least 3 GB of RAM and 1 CPU core.

Detailed Course Agenda

1. Introduction

2. Hadoop Fundamentals

  • The Motivation for Hadoop
  • Hadoop Overview
  • Data Storage: HDFS
  • Data Processing: MapReduce, YARN, and Spark
  • Data Analysis: Pig, Hive, and Impala
  • Database Integration: Apache Sqoop

3. Introduction to Impala and Hive

  • What is Hive?
  • What is Impala?
  • Comparing Hive and Impala to Traditional Databases
  • Use Cases

4. Querying Data with Hive and Impala

  • Using the Impala and Hive Shells
  • Databases, Tables, and Metadata
  • Basic Query Language Syntax
  • Joining Datasets
  • Common Built-in Functions
  • Using Hue to Execute Queries

5. Data Management for Hive and Impala

  • Data Storage
  • Choosing a File Format
  • Creating Databases and Tables
  • Data Types
  • Loading Data
  • Storing Query Results
  • Understanding Query Performance
  • Extension Points for Hive and Impala (Overview)

6. Overview of Machine Learning

  • Introduction: The Three C’s of Machine Learning
  • Collaborative Filtering Use Cases
  • Clustering Use Cases
  • Classification Use Cases
  • Relationship of Algorithms and Data Volume
  • Status of Apache Mahout, Spark MLlib, and Spark ML

7. Choosing the Best Tool for the Job

  • Comparing MapReduce, Spark, Hive, Impala, and Relational Databases
  • Which to Choose?

8. Conclusion

Build a Real-Time Streaming IoT Solution — From Sensor to Dashboard in a Day

In this workshop you will learn how to use Apache NiFi, Kafka, and Storm to create real-time IoT data pipelines. The deployment of IoT sensor networks and the availability of distributed data processing systems means every company is becoming more data-centric as they realize the value in their data.

Date: July 26th, 2016
Time: 8:00am to 5:00pm
Location: City View Ballroom at the St. Louis City Center Hotel, 400 S 14th St, St. Louis, MO 63103
Presenters: Joey Frazee and David Kjerrumgaard

Hortonworks University StampedeCon 2016 Big Data Conference Technical Workshop

More About this Workshop

In this workshop you will learn how to use Apache NiFi, Kafka, and Storm to create real-time IoT data pipelines. The deployment of IoT sensor networks and the availability of distributed data processing systems means every company is becoming more data-centric as they realize the value in their data. Unfortunately, much of the data still isn’t or can’t be used. For example, according to the McKinsey Global Institute, often 1% or less of data from oil rigs is ever examined. Similar patterns are occurring in most industries. To realize the full potential of your IoT applications, you need to collect and analyze all of your data, and you have to do it in realtime. This course is an introduction to dataflow and real-time streaming technologies. We will discuss the state of the industry, provide overviews for NiFi, Kafka and Storm, and get hands-on with all 3, building a fully functional IoT sensor ingestion and analytics pipeline.

Skills Taught

  1. An understanding of dataflow and streaming technology
  2. How to use and operate Apache NiFi, Kafka and Storm together
  3. When to use NiFi, Kafka and Storm and for what purposes
  4. How to build real-time data flows for IoT sensor data ingestion and analytics

Requirements

The hands-on labs will be conducted on a VM that participants will run on their own laptops. Here is what you will need to have done before attending:

Course Objectives
  1. Discuss what data flow is, why it’s important and what tools are available
  2. Apache NiFi
    • Describe NiFi architecture and components (FlowFiles, processors, relationships, connections, process groups, remote process groups)
    • Learn how to build data flows using NiFi
    • Understand how to configure and operate NiFi data flows
    • Become familiar with common NiFi use cases and design patterns
  3. Apache Kafka
    • Describe Kafka architecture (topics, partitions, producers, consumers, brokers)
    • Know how to read from and write to Kafka using NiFi
  4. Apache Storm
    • Describe Storm architecture and topology concepts (tuples, streams, spouts, bolts, Nimbus, supervisors, workers)
    • Know how to build Storm topologies
    • Know how to deliver data to Storm with NiFi
    • Understand when to use Storm vs. NiFi
  5. Develop a complete data flow using NiFi, Kafka, and Storm
Hands-On Labs
  • Building a NiFi data flow
  • Building a Storm topology
  • Creating a real-time IoT data pipeline using NiFi, Kafka and Storm