Piloting Big Data: Where To Start?
You know you have data. You know there are problems it can solve. But how do you bridge the gaps in between? We’ll describe a road-tested strategy for methodically moving from business problem, through an understanding of data requirements and technical capabilities, through to creating a targeted roadmap for your pilot project.
Apache Spark: the next big thing?
It’s been called the leading candidate to replace Hadoop MapReduce. Apache Spark uses fast in-memory processing and a simpler programming model to speed up analytics and has become of the hottest technologies in Big Data.
In this talk we’ll discuss:
- What is Apache Spark and what is it good for?
- Spark’s Resilient Distributed Datasets
- Spark integration with Hadoop, Hive and other tools
- Real-time processing using Spark Streaming
- The Spark shell and API
- Machine Learning and Graph processing on Spark
Big Data: Infrastructure Implications for “The Enterprise of Things”
The amount of data in our world has been exploding, and storing and analyzing large data sets—so-called big data—will become a key basis of competition for the new “Enterprise of Things”, underpinning fresh waves of productivity growth, innovation, and consumer surplus. Leaders in every sector – from government to healthcare to finance – will have to grapple with the implications of big data, as data growth continues unabated for the foreseeable future. The quest to make sense of all this big data begins with breaking down data silos within organizations using the cost appropriate, shared infrastructure to ensure optimal extraction and analysis of data, knowledge and insight.
As the leading global e-commerce service, PayPal has transformed the way the company leverages big data storage and hyper-scale analytics to help improve both the safety and purchasing experiences of its online customers. In this discussion, using real-world customer examples such as PayPal, we will explore what Big Data Storage is from high performance file sharing to long-term archiving, as well as ways to break down data silos to reduce the cost and storage complexity of managing demanding workflows and data environments. We will demonstrate how hyperscale storage can enable near-real-time, stream analytics processing for behavioral and situational modeling, as well as for fraud detection, marketing and systems intelligence. We will ask what the greatest barriers to effective business analytics are and how today’s data analytics platforms, including Hadoop, Vertica, Python and Java, can be optimized to enable machine learning, event streaming, forecasting, and reduce overhead associated with human intervention. You’ll come away from this session understanding the infrastructure implications and options for organizations looking to maximize their big data for competitive advantage.
How we solved large-scale offline data ingestion problem to power Omni-channel use-cases for Retailers
Offline data ingestion is an essential recipe of every platform and is the first step to solve a use-case. By simplifying the process of ingesting data feeds (ranging from product catalogs, point of sale, in-store sensor, social and many more forms of structured & semi-structured data) into the platform, we bring retailers one step closer to fulfilling their omni-channel vision.
We built our component with self-service in mind, in order to give control to the user. Our data ingestion component is built to perform schema discovery, validation and error reporting—such that a user can push any file to the platform without a definition of schema. The entire process is handled by launching custom workflows on the platform. But we don’t stop here, we have the ability to convert this file into various serialization formats like Avro, then create a Hive table where data can be further cleansed and transformed (ELT) using HiveQL.
We’ll discuss the entire architecture, design and technology choices, roadblocks and our lessons learned on the way on how the entire platform was built leveraging open source technologies in the big data eco-system, and how it has accelerated the business.
The Emergence of Smart Cities & Infinite Data
Cities are in the midst of a seismic shift that is staring right at them, yet the subtlety of this shift is very powerful due to its transparency. This is all about the “Internet of Things,” where trillions of connected devices will bring significant change and provide untold opportunities and tremendous challenges for business, government and you. Join us for an exploration of the latest implementations and cutting edge technologies that are generating an infinite amount of data in our urban environments and how Smart Cities are successfully capitalizing on this opportunity. Examples from Shanghai, Singapore, Abu Dhabi, Doha, Dubai, Jeddah, London and Kansas City will be showcased to provide a roadmap for you to move forward in this greatest of ages, the age of the Smart City.
Enabling Key Business Advantage from Big Data through Advanced Ingest Processing
All too often we see critical data dumped into a “Data Lake” causing the data waters to stagnate and become a “Data Swamp”. We have found that many data transformation, quality, and security processes can be addressed a priori on ingest to enhance goodness and improve accessibility to the data. Data can still be stored in raw form if desired but this processing on ingest can unlock operational effectiveness and competitive advantage by integrating fresh and historical data and enable the full potential of the data. We will discuss the underpinnings of stream processing engines, review several relevant business use cases, and discuss future applications.
Which database can handle a 3TB multidimensional array?
At the Climate Corporation, we have a great demand for storing large amounts of raster-based data, and an even greater demand to retrieve small amounts of it quickly. We have a lot of weather data, elevation data, satellite imagery, and many other kinds of data. Doc Brown is our distributed, immutable, versioned database for storing big multidimensional data, and it is one of the core systems in production.
In this talk, we’ll take a look at the library design and features, the engineering challenges of managing versions, indexes, caching, while making it fast to ingest and to query data. We will also walk through an example of how to do quick development and validation of data.
Managing Genomes At Scale: What We Learned
Monsanto generates large amounts of genomic sequence data every year. Agronomists and other scientists use this data as input for predictive analytics to aid breeding and the discovery of new traits such as disease or drought resistance. In order to enable the broadest use possible of this valuable data, scientists would like to query genomic data by species, chromosome, position, and myriad other categories. We present our solutions to these problems, as realized on top of HBase here at Monsanto.We will be discussing our particular learnings around: flat/wide vs tall/narrow HBase schema design, preprocessing and caching windows of data for use in web based visualizations, approaches to complex multi-join queries across deep data sets, and distributed indexing via SolrCloud.
Datomic – A Modern Database
Datomic is a distributed database designed to run on next-generation cloud architectures. Datomic stores facts and retractions using a flexible schema, consistent transactions, and a logic-based query language. The focus on facts over time gives you the ability to look at the state of the database at any point in time and traverse your transactional data in many ways.
We’ll take a tour of the Datomic data model, transactions, query language, and architecture to highlight some of the unique attributes of Datomic and why it is an ideal modern database.
Beyond a Big Data Pilot: Building a Production Data Infrastructure
Creating a data architecture involves many moving parts. By examining the data value chain, from ingestion through to analytics, we will explain how the various parts of the Hadoop and big data ecosystem fit together to support batch, interactive and realtime analytical workloads.
By tracing the flow of data from source to output, we’ll explore the options and considerations for components, including data acquisition, ingestion, storage, data services, analytics and data management. Most importantly, we’ll leave you with a framework for understanding these options and making choices.
Big Data Past, Present and Future – Where are we Headed?
Rob Peglar was one of the speakers at the very first StampedeCon. Following that talk two years ago, Rob will present an overview of and insight into the technologies and system approaches to computing, transport and storage of big data – where we’ve been, are now and are headed. There is a major ‘fork in the road’ upcoming in the treatment and business application of big data and the technology that surrounds it, one that is important enough to change the course of the methodologies and approaches used by large and small business alike, especially for the infrastructure required either on premise or in the cloud.
Big Data Ethics
Neil M. Richards
Big data, broadly defined, is producing increased powers of institutional awareness and power that require the development of a Big Data Ethics. We are building a new digital society, and the values we build or fail to build into our new digital structures will define us. Critically, if we fail to balance the human values that we care about, like privacy, confidentiality, transparency, identity and free choice with the compelling uses of big data, our Big Data Society risks abandoning these values for the sake of innovation and expediency. First, we’ll trace the origins and rapid growth of the Information Revolution. Second, we’ll call for the development of a “Big Data Ethics,” a set of four related principles that should govern data flows in our information society, and inform the establishment of big data norms such as confidentiality, transparency, and the protection of identity. Finally, we’ll suggest how we might integrate big data ethics into our society. Law will be an important part of Big Data Ethics, but so too must the establishment of ethical principles and best practices that guide government, corporations, and users. We must all be part of the conversation, and part of the solution. Big Data Ethics are for everyone.
GPUs in Big Data
Modern graphics processing units (GPUs) are massively parallel general-purpose processors that are taking Big Data by storm. In terms of power efficiency, compute density, and scalability, it is clear now that commodity GPUs are the future of parallel computing. In this talk, we will cover diverse examples of how GPUs are revolutionizing Big Data in fields such as machine learning, databases, genomics, and other computational sciences.
Making Machine Learning work in Practice
Kilian Q. Weinberger
Here I will go over common pitfalls and tricks on how to make machine learning work.
The Evolution of Data Analysis with Hadoop
This session will lead the audience through the evolution of data analysis in Hadoop to illustrate its progression from the original low-level, batch-oriented MapReduce approach to today’s higher-level interactive tools that require very little technical knowledge. We’ll discuss Apache Crunch, Hive, Impala and Solr.
While the nature of this talk is somewhat technical, no prior knowledge of Hadoop or any specific programming language is required. Frequent live demonstrations of the tools discussed will emphasize that analyzing data in Hadoop can be as easy as using a relational database or Internet search engine.