Explore your data

The following is a guest post from John Mallory, CTO of Analytics in EMC’s Emerging Technology Division.  John will be in St. Louis at the StampedeCon 2016 Big Data Conference presenting Best Practices For Building & Operating A Managed Data Lake.

Due to the scalable analytics platform it enables, the Data Lake architecture is a hot topic these days and, as such, there are many definitions floating around.  Gartner’s Nick Heudecker defines a data lake as “an enterprise wide data storage and management platform for analyzing disparate sources of data in its native format. The idea is simple: instead of placing data in a purpose built data store, you move it into a data lake in its original format.”

This is a pretty neutral definition that captures the essence, but let’s break it down and make it more specific. Building on Nick’s definition and getting a bit more specific, I’d like to propose that a true data lake has the following attributes:

  • Scale-out architecture that can be expanded non-disruptively and on-demand
  • Capacity and performance scale predictably as new nodes/resources are added
  • Multi-protocol, so that data can be ingested via native protocols and in its native form
  • Apache HDFS/Hadoop capable, to leverage the broad Apache ecosystem of analytics tools
  • Enterprise data protection, security, data governance and availability capabilities
  • Integrates with existing Relational Databases (RDBMS) and Data Warehouses (EDWs)
  • Allows for integration of multiple software vendors and best of breed hardware components

Distinctions from Traditional Storage Platforms

Given the above, how is a data lake different from most existing storage platforms? Many modern storage platforms have the Enterprise data protection, security and data governance capabilities solved and can integrate with many data sources and existing RDBMS and EDWs via standard NAS protocols, but that isn’t enough for a data lake. The two primary differences are possessing a native scale out architecture and having native in-place HDFS capabilities.

Many existing NAS or Object platforms can scale—block platforms can also scale, but given that the majority of “new” data is unstructured, it makes more sense to have a platform with an integrated filesystem—but a key data lake capability is to be able to scale-out by symmetrically and intelligently sharding data across multiple spindles and nodes so that performance capabilities grow linearly as storage nodes are added and capacity grows.

A second key capability is for the data lake platform to not just support HDFS, but to be able to do Hadoop analytics on data in place in a performant manner. Many newer NAS and Object storage platforms can support a Hadoop compatible filesystem (HCFS), which allows them to be integrated with existing Hadoop clusters, but they don’t integrate native HDFS NameNode and DataNode capabilities, which compromises performance, scalability and in place analytics.

Is Hadoop a Data Lake?

On the flip side, many are calling a traditional DAS Hadoop implementation a data lake. DAS Hadoop (Hadoop/HDFS servers with internal or directly attached storage) certainly meets the scaling criteria, supports the Apache Hadoop ecosystem and can easily integrate with existing RDBMS and EDWs; however, there are a few key areas where DAS Hadoop falls short of meeting all of the data lake criteria. These two gaps are true multi-protocol capabilities, and true “best of breed” enterprise capabilities.

A DAS Hadoop cluster utilizes Apache HDFS as its underlying filesystem. Hadoop has many varied data connectors (like Sqoop, Flume, and NiFi) but it currently cannot natively support NAS protocols (NFS & SMB). There are NFS gateway approaches, but this is not a native protocol implementation and has shortcomings in performance, scalability and availability. This gap comes about because HDFS is not a true POSIX filesystem by design, and so it cannot natively support POSIX protocols like NFS and SMB.

Apache has also made many strides in providing enterprise security and availability features for Hadoop, but there are key gaps compared to what is state of the art for existing enterprise storage solutions. The biggest gaps are in efficiency (Hadoop utilizes 3 copies for data protection, versus RAID or erasure coding (coming in Hadoop 3.0)), the ability to provide 4 or 5 nines of data availability, and the ability to disaggregate Hadoop compute and the HDFS storage layer to provide for virtualization and true multi-tenancy.

Why not implement with an Existing Data Warehouse?

Finally, why not use the existing Enterprise Data Warehouse to implement a data lake? They certainly meet most of these criteria, particularly since many leading EDW vendors are now integrating Hadoop or Big Data appliances with their EDWs to cover both structured and unstructured data types and to be able to leverage Apache Hadoop analytics. The two biggest drawbacks of this approach are the cost of enterprise data storage for the EDW, and the fact that the Hadoop implementation for these solutions still leverages traditional DAS Hadoop, which has the drawbacks covered above.

Virtually all technology solutions—like most things in life—require some tradeoffs and compromises; the data lake is no exception to this rule. We’ve covered what a data lake should be, and some of the challenges and tradeoffs of implementing it with three major categories of existing solutions.

Are there better ways of building a data lake today? We believe so, and I’ll be in St. Louis sharing more details at the StampedeCon 2016 Big Data Conference  during my session Best Practices For Building & Operating A Managed Data Lake.  Hope to see you there!