Apache Hive is a good tool for performing ETL and basic analytics but is limited in statistical analysis and data exploration capabilities. R, on the other hand, has become a preferred language for analytics, as it offers a wide variety of statistical and graphical packages. The downside is that R is single threaded and memory intensive thus making it difficult to work with data at scale.
At StampedeCon 2014 on May 29-30 in St. Louis, RichRelevance’s Senior Software Engineer Sukhendu Chakraborty will discuss the R to Hive connector package they’ve developed to meet their need of using R on terabytes of data feasibly while maintaining R code that is independent of the data source.
A number of existing open-source packages, such as Rhive, RHadoop and RHIPE address R’s scalability to different extents. And, with some work, there are also ways to link R to optimized multi-threaded libraries. In a detailed overview of RichRelevance’s approach, Chakraborty sites these advantages and differentiators of having an R to Hive connector:
- R proxy object extensions: The connector would provide the much-needed transparency between R and the data source (Hive, Postgres, etc.) and computation platform (Hadoop).
- Pluggable Query generation: The connector could separate physical query generation pertaining to a data source from the analytical logic written in R thus making the data source pluggable.
- R as a scalable analytical tool: Data analysts are likely to perform the following steps once they have the raw data. All of these steps can be performed using the proposed R connector:
- Data cleanup: Filter out NAs, outliers, etc.
- Ad-hoc analytics: fivenums, stdev, summary, etc.
- Data preparation: Joining tables, project out columns, filter values, etc.
- Distributed analytics using Hadoop: Submit MR jobs on the Hadoop cluster from R and apply rich R analytical functions on the data-sets in a distributed fashion
- Result summarization and publishing: Accumulated results available readily in the client for inferences, reporting, etc.
Sukhendu Chakraborty will be speaking at StampedeCon 2014 and will go through a series of use cases demonstrating how this approach allows his team to bridge the gap between R and Hive and make big data analysis using R on terabytes of data feasible and in such that the data source is agnostic to the tools which are used to access it.