Chapter 3 R & Microsoft R Server - todo

3.1 Functional Programming and Lazy Evaluation in R

As we mentioned in Section 2.1, Spark takes advantage of the functional programming paradigm and lazy evaluation to optimize it’s operations and improve upon algorithmic complexity. R is also at it’s heart a functional programming langauge. Moreover, the arguments in a function are evaluated lazily by R: only evaluated if they’re actually used, and only when they’re needed. This allows R to be highly expressive, capable of doing many intricate things with few lines of code, but also causes R to have a rather heavy memory footprint.

Many packages for R have been written to take advantage of it’s lazy, and non-standard evaluation procedures. Most famously, the dplyr package utlizes R’s NSE mechanism to have it’s functions connect to backends in different databases, translating R into syntax that can be understood and evaluated by those backends.

Thee RevoScaleR similarly reimagines R’s algorithms as distributable C++ code, taht can be optimized and compiled in various compute contexts.

3.2 PEMA Algorithms and the RevoScaleR Package

Parallel External Memory Algorithms (PEMA) are algorithms that do not require all the data to be in memory at one time. This is in contrast to all of the functions in base R, and most of the methods in additional R packages.

RevoScaleR implements many of the most popular statistical models as PEMAs, so that they can be estimated on data that is far larger than the memory capacity of your platform. Moreover, the PEMA algorithms exposed by RevoScaleR cn run on a variety of different platforms, including Spark, and therefore, can scale out very efficiently.

3.3 eXternal Data Frames (XDFs)

The most efficient way to use PEMA algorithms is to apply them to the novel data type provided by RevoScaleR, XDFs, short for external data frames. Unlike traditional R objects, and R data.frames, a xdf is a chunked data set that can be stored in a variety of data stores: disks, Hadoop distributed file systems, SQL Server databases, and many more. Rather than bringing your data into R, we move the analytics from R and into your data stores, and utilize the high performance PEMA algorithms to estimate models as efficiently as possible.

3.4 Compute Contexts