Appendix G. Working with large datasets

R holds all of its objects in virtual memory. For most of us, this design decision has led to a zippy interactive experience, but for analysts working with large datasets, it can lead to slow program execution and memory-related errors.

Memory limits will depend primarily on the R build (32 versus 64-bit) and for 32-bit Windows, on the OS version involved. Error messages starting with cannot allocate vector of size typically indicate a failure to obtain sufficient contiguous memory, while error messages starting with cannot allocate vector of length indicate that an address limit has been exceeded. When working with large datasets, try to use a 64-bit build if at all possible. For all builds, the number of elements in a vector is limited to 2,147,483,647 (see ?Memory for more information).

There are three issues to consider when working with large datasets: (a) efficient programming to speed execution, (b) storing data externally to limit memory issues, and (c) using specialized statistical routines designed to efficiently analyze massive amounts of data. We will briefly consider each.

G.1. Efficient programming

There are a number of programming tips that improve performance when working with large datasets.

  • Vectorize calculations when possible. Use R’s built-in functions for manipulating vectors, matrices, and lists (for example, sapply, lappy, and mapply) and avoid loops (for and while) when feasible.
  • Use matrices rather than data frames (they have less overhead).
  • When using the read.table() family of functions to input external data into data frames, specify the colClasses and nrows options explicitly, set comment. char = "", and specify "NULL" for columns that aren’t needed. This will decrease memory usage and speed up processing considerably. When reading external data into a matrix, use the scan() function instead.
  • Test programs on subsets of the data, in order to optimize code and remove bugs, before attempting a run on the full dataset.
  • Delete temporary objects and objects that are no longer needed. The call rm(list=ls()) will remove all objects from memory, providing a clean slate. Specific objects can be removed with rm( object).
  • Use the function .ls.objects() described in Jeromy Anglim’s blog entry “Memory Management in R: A Few Tips and Tricks” (jeromyanglim.blogspot.com), to list all workspace objects sorted by size (MB). This function will help you find and deal with memory hogs.
  • Profile your programs to see how much time is being spent in each function. You can accomplish this with the Rprof() and summaryRprof() functions. The system.time() function can also help. The profr and prooftools packages provide functions that can help in analyzing profiling output.
  • The Rcpp package can be used to transfer R objects to C++ functions and back when more optimized subroutines are needed.

With large datasets, increasing code efficiency will only get you so far. When bumping up against memory limits, you can also store our data externally and use specialized analysis routines.

G.2. Storing data outside of RAM

There are several packages available for storing data outside of R’s main memory. The strategy involves storing data in external databases or in binary flat files on disk, and then accessing portions as they are needed. Several useful packages are described in table G.1.

Table G.1. R packages for accessing large datasets

Package

Description

ff Provides data structures that are stored on disk but behave as if they were in RAM.
bigmemory Supports the creation, storage, access, and manipulation of massive matrices. Matrices are allocated to shared memory and memory-mapped files.
filehash Implements a simple key-value database where character string keys are associated with data values stored on disk.
ncdf, ncdf4 Provides an interface to Unidata netCDF data files.
RODBC, RMySQL, ROracle, RPostgreSQL, RSQLite Each provides access to external relational database management systems.

The packages above help overcome R’s memory limits on data storage. However, specialized methods are also needed when attempting to analyze large datasets in a reasonable length of time. Some of the most useful are described below.

G.3. Analytic packages for large datasets

R provides several packages for the analysis of large datasets:

  • The biglm and speedglm packages fit linear and generalized linear models to large datasets in a memory efficient manner. This offers lm() and glm() type functionality when dealing with massive datasets.
  • Several packages offer analytic functions for working with the massive matrices produced by the bigmemory package. The biganalytics package offers k-means clustering, column statistics, and a wrapper to biglm. The bigtabu-late package provides table(), split(), and tapply() functionality and the bigalgebra package provides advanced linear algebra functions.
  • The biglars package offers least-angle regression, lasso, and stepwise regression for datasets that are too large to be held in memory, when used in conjunction with the ff package.
  • The Brobdingnag package can be used to manipulate large numbers (numbers larger than 2^1024).

Working with datasets in the gigabyte to terabyte range can be challenging in any language. For more information on the methods available within R, see the CRAN Task View: High-Performance and Parallel Computing with R (cran.r-project.org/web/views/).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset