RlogoApache Hive is a good tool for performing ETL and basic analytics but is limited in statistical analysis and data exploration capabilities. R, on the other hand, has become a preferred language for analytics, as it offers a wide variety of statistical and graphical packages. The downside is that R is single threaded and memory intensive thus making it difficult to work with data at scale.

At StampedeCon 2014 on May 29-30 in St. Louis, RichRelevance’s Senior Software Engineer Sukhendu Chakraborty will discuss the R to Hive connector package they’ve developed to meet their need of using R on terabytes of data feasibly while maintaining R code that is independent of the data source.

A number of existing open-source packages, such as Rhive, RHadoop and RHIPE address R’s scalability to different extents.  And, with some work, there are also ways to link R to optimized multi-threaded libraries.  In a detailed overview of RichRelevance’s approach, Chakraborty sites these advantages and differentiators of having an R to Hive connector:

  • R proxy object extensions: The connector would provide the much-needed transparency between R and the data source (Hive, Postgres, etc.) and computation platform (Hadoop).
  • Pluggable Query generation: The connector could separate physical query generation pertaining to a data source from the analytical logic written in R thus making the data source pluggable.
  • R as a scalable analytical tool: Data analysts are likely to perform the following steps once they have the raw data. All of these steps can be performed using the proposed R connector:
    1. Data cleanup: Filter out NAs, outliers, etc.
    2. Ad-hoc analytics: fivenums, stdev, summary, etc.
    3. Data preparation: Joining tables, project out columns, filter values, etc.
    4. Distributed analytics using Hadoop: Submit MR jobs on the Hadoop cluster from R and apply rich R analytical functions on the data-sets in a distributed fashion
    5. Result summarization and publishing: Accumulated results available readily in the client for inferences, reporting, etc.

Sukhendu Chakraborty will be speaking at StampedeCon 2014 and will go through a series of use cases demonstrating how this approach allows his team to bridge the gap between R and Hive and make big data analysis using R on terabytes of data feasible and in such that the data source is agnostic to the tools which are used to access it.