Chapter 24. Building R Packages

As of late-July 2013, there were 4,714 packages on CRAN and another 671 on Bioconductor, with more being added daily. In the past, building a package had the potential to be mystifying and complicated but that is no longer the case, especially when using Hadley Wickham’s devtools package.

All packages submitted to CRAN (or Bioconductor) must follow specific guidelines, including the folder structure of the package, inclusion of DESCRIPTION and NAMESPACE files and proper help files.

24.1. Folder Structure

An R package is essentially a folder of folders, each containing specific files. At the very minimum there must be two folders, one called R where the included functions go, and the other called man where the documentation files are placed. It used to be that the documentation had to be be written manually, but thanks to roxygen2 that is no longer necessary, as is seen in Section 24.3. Starting with R 3.0.0, CRAN is very strict in requiring that all files must end with a blank line and that code examples must be shorter than 105 characters.

In addition to the R and man folders, other common folders are src for compiled code such as C++ and FORTRAN, data for data that is included in the package and inst for files that should be available to the end user. No files from the other folders are available in a human-readable form (except the INDEX, LICENSE and NEWS files in the root folder) when a package is installed. Table 24.1 lists the most common folders used in an R package.

Image

Table 24.1 Folders Used in R Packages (While there are other possible folders, these are the most common)

24.2. Package Files

The root folder of the package must contain at least a DESCRIPTION file and a NAMESPACE file, which are described in Sections 24.2.1 and 24.2.2. Other files like NEWS, LICENSE and README are recommended but not necessary. Table 24.2 lists commonly used files.

Image

Table 24.2 Files Used in R Packages (While there are other possible files, these are the most common)

24.2.1. DESCRIPTION File

The DESCRIPTION file contains information about the package, such as its name, version, author and other packages it depends on. The information is entered, each on one line, as Item1: Value1. Table 24.3 lists a number of fields that are used in DESCRIPTION files.

Image

Table 24.3 Fields in the DESCRIPTION File

The Package field specifies the name of the package. This is the name that appears on CRAN and how users access the package.

Type is a bit archaic; it can be either Package or one other type, Frontend, which is used for building a graphical front end to R and will not be helpful for building an R package of functions.

Title is a short description of the package. It should be relatively brief and cannot end in a period. Description is a complete description of the package, which can be several sentences long but no longer than a paragraph.

Version is the package version and usually consists of three period-separated integers; for example, 1.15.2. Date is the release date of the current version.

The Author and Maintainer fields are similar but both are necessary. Author can be multiple people, separated by commas, and Maintainer is the person in charge, or rather the person who gets complained to, and should be a name followed by an email address inside angle brackets (<>). An example is Maintainer: Jared P. Lander <[email protected]>. CRAN is actually very strict about the Maintainer field and can reject a package for not having the proper format.

License information goes in the appropriately named License field. It should be either an abbreviation of one of the standard specifications such as GPL-2 or BSD or the string 'file LICENSE' referring to the LICENSE file in the package’s root folder.

Things get tricky with the Depends, Imports and Suggests fields. Often a package requires functions from other packages. In that case the other package, for example, ggplot2, should be listed in either the Depends or Imports field as a comma-separated list. If ggplot2 is listed in Depends, then when the package is loaded so will ggplot2, and its functions will be available to functions in the package and to the end user. If ggplot2 is listed in Imports, then when the package is loaded ggplot2 will not be loaded, and its functions will be available to functions in the package but not the end user. Packages should be listed in one or the other, not both. Packages listed in either of these fields will be automatically installed from CRAN when the package is installed. If the package depends on a specific version of another package, then that package name should be followed by the version number in parentheses; for example, Depends: ggplot2 (>= 0.9.1). Packages that are needed for the examples in the documentation, vignettes or testing but are not necessary for the package’s functionality should be listed in Suggests.

The Collate field specifies the R code files contained in the R folder. This will be populated automatically if the package is documented using roxygen2 and devtools.

A relatively new feature is byte-compilation, which can significantly speed up R code. Setting ByteCompile to TRUE will ensure the package is byte-compiled when installed by the end user.

The DESCRIPTION file from coefplot is shown next.


Package: coefplot
Type: Package
Title: Plots Coefficients from Fitted Models
Version: 1.1.9
Date: 2013-01-23
Author: Jared P. Lander
Maintainer: Jared P. Lander <[email protected]>
Description: Plots the coefficients from a model object
License: BSD
LazyLoad: yes
Depends:
    ggplot2
Imports:
    plyr,
    stringr,
    reshape2,
    useful,
    scales,
    proto
Collate:
    'coefplot.r'
    'coefplot-package.r'
    'multiplot.r'
    'extractCoef.r'
    'buildPlottingFrame.r'
    'buildPlot.r'
    'dodging.r'
ByteCompile: TRUE


24.2.2. NAMESPACE File

The NAMESPACE file specifies which functions are exposed to the end user (not all functions in a package should be) and which other packages are imported into the NAMESPACE. Functions that are exported are listed as export(multiplot) and imported packages are listed as import(plyr). Building this file by hand can be quite tedious, so fortunately roxygen2 and devtools can, and should, build this file automatically.

R has three object-oriented systems: S3, S4 and Reference Classes. S3 is the oldest and simplest of the systems and is what we will focus on in this book. It consists of a number of generic functions such as print, summary, coef and coefplot. The generic functions exist only to dispatch object-specific functions. Typing print into the console shows this.

> print
function (x, ...)
UseMethod("print")
<bytecode: 0x00000000081e2bb0>
<environment: namespace:base>

It is a single-line function containing the command UseMethod("print"), which tells R to call another function depending on the class of the object passed. These can be seen with methods(print). To save space we show only 20 of the results. Functions not exposed to the end user are marked with an asterisk (*). All of the names are print and the object class separated by a period.

> methods(print)

 [1] print.aareg*                  print.abbrev*
 [3] print.acf*                    print.AES*
 [5] print.agnes*                  print.anova
 [7] print.Anova*                  print.anova.gam
 [9] print.anova.lme*              print.anova.loglm*
[11] print.aov*                    print.aovlist*
[13] print.ar*                     print.Arima*
[15] print.arima0*                 print.arma*
[17] print.AsIs                    print.aspell*
[19] print.aspell_inspect_context* print.balance*
 [ reached getOption("max.print") -- omitted 385 entries ]

   Non-visible functions are asterisked

When print is called on an object, it then calls one of these functions depending on the type of object. For instance, a data.frame is sent to print.data.frame and an lm object is sent to print.lm.

These different object-specific functions that get called by generic S3 functions must be declared in the NAMESPACE in addition to the functions that are exported. This is indicated as S3Method(coefplot, lm) to say that coefplot.lm is registered with the coefplot generic function.

The NAMESPACE file from coefplot is shown next.


S3method(coefplot,default)
S3method(coefplot,glm)
S3method(coefplot,lm)
S3method(coefplot,rxGlm)
S3method(coefplot,rxLinMod)
S3method(coefplot,rxLogit)
S3method(extract.coef,default)
S3method(extract.coef,glm)
S3method(extract.coef,lm)
S3method(extract.coef,rxGlm)
S3method(extract.coef,rxLinMod)
S3method(extract.coef,rxLogit)
export(buildModelCI)
export(coefplot)
export(coefplot.default)
export(coefplot.glm)
export(coefplot.lm)
export(coefplot.rxGlm)
export(coefplot.rxLinMod)
export(coefplot.rxLogit)
export(collidev)
export(extract.coef)
export(multiplot)
export(plotcoef)
export(pos dodgev)
import(ggplot2)
import(plyr)
import(proto)
import(reshape2)
import(scales)
import(stringr)
import(useful)


Even with a small package like coefplot, building the NAMESPACE file by hand can be tedious and error prone, so it is best to let devtools and roxygen2 build it.

24.2.3. Other Package Files

The NEWS file is for detailing what is new or changed in each version. The three most recent entries in the coefplot NEWS file are shown next. Notice how it is good practice to thank people who helped with or inspired the update. This file will be available to the end user’s installation.

The LICENSE file is for specifying more detailed information about the package’s license and will be available to the end user’s installation. The LICENSE file from coefplot is shown here.

The README file is purely informational and is not included in the end user’s installation. Its biggest benefit may be for packages hosted on GitHub, where the README will be the information displayed on the project’s home page.

24.3. Package Documentation

A very strict requirement for R packages to be accepted by CRAN is proper documentation. Each exported function in a package needs its own .Rd file that is written in a LATEX-like syntax. This can be difficult to write for even simple functions like the following one.

> simple.ex <- function(x, y)
+ {
+     return(x * y)
+ }

Even though it has only two arguments and simply returns the product of the two, it has a lot of necessary documentation, shown here.


ame{simple.ex}
alias{simple.ex}
itle{within.distance}
usage{simple.ex(x, y)}
arguments{
    item{x}{A numeric}
    item{y}{A second numeric}
}
value{x times y}
description{Compute distance threshold}
details{This is a simple example of a function}
author{Jared P. Lander}
examples{
    simple.ex(3, 5)
}


Rather than taking this two-step approach, it is better to write function documentation along with the function. That is, the documentation is written in a specially commented out block right above the function, as shown here.

> #' @title simple.ex
> #' @description Simple Example
> #' @details This is a simple example of a function
> #' @aliases simple.ex
> #' @author Jared P. Lander
> #' @export simple.ex
> #' @param x A numeric
> #' @param y A second numeric
> #' @return x times y
> #' @examples
> #' simple.ex(5, 3)
> simple.ex <- function(x, y)
+ {
+     return(x * y)
+ }

Running document from devtools will automatically generate the appropriate .Rd file based on the block of code above the function. The code is indicated by #' at the beginning of the line. Table 24.4 lists a number of commonly used roxygen2 tags.

Image

Table 24.4 Tags Used in roxygen2 Documentation of Functions

Every argument must be documented with a @param tag, including the dots (. . . ), which are written as dots. There must be an exact correspondence between @param tags and arguments; one more or less will cause an error.

It is considered good form to show examples of a function’s usage. This is done on the lines following the @examples tag. In order to be accepted by CRAN all of the examples must work without error. In order to show, but not run, the examples wrap them in dontrun{...}.

Knowing the type of object is important when using a function, so @return should be used to describe the returned object. If the object is a list, the @return tag should be an itemized list of the form item{name a}{description a}item{name b} {description b}.

Help pages are typically arrived at by typing ?FunctionName into the console. The @aliases tag uses a space-separated list to specify the names that will lead to a particular help file. For instance, using @aliases coefplot plotcoef will result in both ?coefplot and ?plotcoef leading to the same help file.

In order for a function to be exposed to the end user, it must be listed as an export in the NAMESPACE file. Using @export FunctionName automatically adds export(FunctionName) to the NAMESPACE file. Similarly, to use a function from another package, that package must be imported and @import PackageName adds import(PackageName) to the NAMESPACE file.

When building functions that get called by generic functions, such as coefplot.lm or print.anova, the @S3method tag should be used. @S3method GenericFunction Class adds S3method(GenericFunction,class) to the NAMESPACE file. When using @S3method it is a good idea to also use @method with the same arguments. This is shown in the following function.

> #' @title print.myClass
> #' @aliases print.myClass
> #' @method print myClass
> #' @S3method print myClass
> #' @export print.myClass
> #' @param x Simple object
> #' @param ... Further arguments to be passed on
> #' @return The top 5 rows of x
> print.myClass <- function(x, ...)
+ {
+     class(x) <- "list"
+     x <- as.data.frame(x)
+     print.data.frame(head(x, 5))
+ }

24.4. Checking, Building and Installing

Building a package used to require going to the command prompt and using commands like R CMD check, R CMD build and R CMD INSTALL (in Windows it is Rcmd instead of R CMD), which required being in the proper directory, knowing the correct options and other bothersome time wasters. Thanks to Hadley Wickham, this has all been made much easier and can be done from within the R console.

The first step is to make sure a package is properly documented by calling document. The first argument is the path to the root folder of the package as a string. (If the current working directory is the same as the root folder, then no arguments are even needed. This is true of all the devtools functions.) This builds all the necessary .Rd files, the NAMESPACE file and the Collate field of the DESCRIPTION file.

> require(devtools)
> document()

After the package is properly documented, it is time to check it. This is done using check with the path to the package as the first argument. This will make note of any errors or warnings that would prevent CRAN from accepting the package. CRAN can be very strict, so it is essential to address all the issues.

> check()

Building the package is equally simple using the build function, which also takes the path to the package as the first argument. By default it builds a .tar.gz—a collection of all the files in the package—that still needs to be built into a binary that can be installed in R. It is portable in that it can be built on any operating system. The binary argument, if set to TRUE, will build a binary that is operating system specific. This can be problematic if compiled source code is involved.

> build()
> build(binary = TRUE)

Other functions to help with the development process are install, which rebuilds and loads the package, and load all, which simulates the loading of the package and NAMESPACE.

Another great function, not necessarily for the development process so much as for getting other people’s latest work, is install github, which can install an R package directly from a GitHub repository. There are analogous functions for installing from BitBucket (install bitbucket) and Git (install git) in general.

For instance, to get the latest version of coefplot the following code should be run. By the time of publication this might no longer be the the latest version.

> install_github(repo = "coefplot", username = "jaredlander",
+                ref = "survival")

Sometimes an older version of a package on CRAN is needed, which under normal circumstances is hard to do without downloading source packages manually and building them. However, install version was recently added to devtools, allowing a specific version of a package to be downloaded from CRAN, built and installed.

24.5. Submitting to CRAN

The best way to get a package out to the R masses is to have it on CRAN. Assuming the package passed the check using check from devtools, it is ready to be uploaded to CRAN using the new Web uploader (as opposed to using FTP) at http://xmpalantir.wu.ac.at/cransubmit/. The .tar.gz file is the one to upload. After submission, CRAN will send an email requiring confirmation that the package was indeed uploaded by the maintainer. Alternatively, the package can be uploaded by anonymous FTP to ftp://CRAN.R-project.org/incoming/ with an email sent to Uwe Ligges at [email protected] and to [email protected]. The subject line must be of the format CRAN Upload: PackageName PackageVersion. The name of the package is case sensitive and must match the name of the package in the DESCRIPTION file. The body of the message does not have to follow any guidelines, but should be polite and include the words “thank you” somewhere, because the CRAN team puts in an incredible amount of effort despite not getting paid.

24.6. C++ Code

Sometimes R code is just not fast enough (even when byte-compiled) for a given problem and a compiled language must be used. R’s foundation in C and links to FORTRAN libraries (digging deep enough into certain functions, such as lm, reveals that the underpinnings are written in FORTRAN) makes incorporating those languages fairly natural. .Fortran is used for calling a function written in FORTRAN and .Call is used for calling C and C++ functions.1 Even with those convenient functions, knowledge of either FORTRAN or C/C++ is still necessary, as is knowledge of how R objects are represented in the underlying language.

1. There is also a .C function, although despite much debate it is generally frowned upon.

Thanks to Dirk Eddelbuettel and Romain François, integrating C++ code has become much easier using the Rcpp package. It handles a lot of the scaffolding necessary to make C++ functions callable from R. Not only did they make developing R packages with C++ easier, but they also made running ad hoc C++ possible.

A number of tools are necessary for working with C++ code. First, a proper C++ compiler must be available. To maintain compatibility it is best to use gcc.

Linux users should already have gcc installed and should not have a problem, but they might need to install g++.

Mac users need to install Xcode and might have to manually select g++. The compiler offered on Mac generally lags behind the most recent version available, which has been known to cause some issues.

Windows users should actually have an easy time getting started, thanks to RTools developed by Brian Ripley and Duncan Murdoch. It provides all necessary development tools, including gcc and make. The proper version, depending on the installed version of R, can be downloaded from http://cran.r-project.org/bin/windows/Rtools/ and installed like any other program. It installs gcc and makes the Windows command prompt act more like a BASH terminal. If building packages from within R using devtools and RStudio (which is the best way now), then the location of gcc will be determined from the operating system’s registry. If building packages from the command prompt, then the location of gcc must be put at the very beginning of the system PATH like c:Rtoolsbin;c:Rtoolsgcc-4.6.3bin;C:UsersJared DocumentsRR-3.0.0binx64.

A LATEX distribution is needed for building package help documents and vignettes. Table 23.1 lists the primary distributions for the different operating systems.

24.6.1. sourceCpp

To start, we build a simple C++ function for adding two vectors. Doing so does not make sense from a practical point of view because R already does this natively and quickly, but it will be good for illustrative purposes. The function will have arguments for two vectors and return the element-wise sum. The // [[Rcpp::export]] tag tells Rcpp that the function should be exported for use in R.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector vector_add(NumericVector x, NumericVector y)
{
    // declare the result vector
    NumericVector result(x.size());

    // loop through the vectors and add them element by element
    for(int i=0; i<x.size(); ++i)
    {
        result[i] = x[i] + y[i];
    }
    return(result);
}

This function should be saved in a .cpp file (for example, vector add.cpp) or as a character variable so it can be sourced using sourceCpp, which will automatically compile the code and create a new R function with the same name that, when called, executes the C++ function.

> require(Rcpp)
> sourceCpp("vector_add.cpp")

Printing the function shows that it points to a temporary location where the compiled function is currently stored.

> vector_add
function (x, y)
.Primitive(".Call")(<pointer: 0x0000000066e81710>, x, y)

The function can now be called just like any other R function.

> vector_add(x = 1:10, y = 21:30)

 [1] 22 24 26 28 30 32 34 36 38 40

> vector_add(1, 2)

[1] 3

> vector_add(c(1, 5, 3, 1), 2:5)

[1] 3 8 7 6

JJ Allaire (the founder of RStudio) is responsible for sourceCpp, the // [[Rcpp::export]] shortcut and a lot of the magic that simplifies using C++ with R in general. Rcpp maintainer Dirk Eddelbuettel cannot stress enough how helpful Allaire’s contributions have been.

Another nice feature of Rcpp is the syntactic sugar that allows C++ code to be written like R. Using sugar we can rewrite vector add with just one line of code.

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector vector_add(NumericVector x, NumericVector y)
{
    return(x + y);
}

The syntactic sugar allowed two vectors to be added just as if they were being added in R.

Because C++ is a strongly typed language, it is important that function arguments and return types be explicitly declared using the correct type. Typical types are NumericVector, IntegerVector, LogicalVector, CharacterVector, DataFrame and List.

24.6.2. Compiling Packages

While sourceCpp makes ad hoc C++ compilation easy, a different tactic is needed for building R packages using C++ code. The C++ code is put in a .cpp file inside the src folder. Any functions preceded by // [[Rcpp::export]] will be converted into end user facing R functions when the package is built using build from devtools. Any roxygen2 documentation written above an exported C++ function will be used to document the resulting R function.

The vector add function should be rewritten using roxygen2 and saved in the appropriate file.

# include <Rcpp.h>
using namespace Rcpp;

//' @title vector_add
//' @description Add two vectors
//' @details Adding two vectors with a for loop
//' @author Jared P. Lander
//' @export vector_add
//' @aliases vector_add
//' @param x Numeric Vector
//' @param y Numeric Vector
//' @return a numeric vector resulting from adding x and y
//' @useDynLib ThisPackage
// [[Rcpp::export]]
NumericVector vector_add(NumericVector x, NumericVector y)
{
    NumericVector result(x.size());

    for(int i=0; i<x.size(); ++i)
    {
        result[i] = x[i] + y[i];
    }

    return(result);
}

The magic is that Rcpp compiles the code, and then creates a new .R file in the R folder with the corresponding R code. In this case it builds the following.

> # This file was generated
> # by Rcpp::compileAttributes Generator token:
> # 10BE3573-1514-4C36-9D1C-5A225CD40393
>
> #' @title vector_add
> #' @description Add two vectors
> #' @details Adding two vectors with a for loop
> #' @author Jared P. Lander
> #' @export vector_add
> #' @aliases vector_add
> #' @param x Numeric Vector
> #' @param y Numeric Vector
> #' @useDynLib RcppTest
> #' @return a numeric vector resulting from adding x and y
> vector_add <- function(x, y)
+ {
+     .Call("RcppTest_vector_add", PACKAGE = "RcppTest", x, y)
+ }

It is simply a wrapper function that uses .Call to call the compiled C++ function.

Any functions that are not preceded by // [[Rcpp::export]] are available to be called from within other C++ functions, but not from R, using .Call. Specifying a name attribute in the export statement—like // [[Rcpp::export(name="NewName"]]—causes the resulting R function to be called that name. Functions that do not need an R wrapper function automatically built, but need to be callable using .Call, should be placed in a separate .cpp file where // [[Rcpp::interfaces(cpp)]] is declared and each function that is to be user accessible is preceded by // [[Rcpp::export]].

In order to expose its C++ functions, a package’s NAMESPACE must contain useDynLib(PackageName). This can be accomplished by putting the @useDynLibPackageName tag in any of the roxygen2 blocks. Further, if a package uses Rcpp the DESCRIPTION file must list Rcpp in both the LinkingTo and Depends fields. The LinkingTo field also allows easy linking to other C++ libraries such as RcppArmadillo, bigmemory and BH (Boost).

The src folder of the package must also contain Makevars and Makevars.win files to help with compilation. The following examples were automatically generated using Rcpp.package.skeleton and should be sufficient for most packages.

First the Makevars file:


## Use the R_HOME indirection to support installations of multiple
## R version

PKG_LIBS = `$(R_HOME)/bin/Rscript -e "Rcpp:::LdFlags()"`


## As an alternative, one can also add this code in a file 'configure'
##
##    PKG_LIBS=
`${R_HOME}/bin/Rscript -e "Rcpp:::LdFlags()"`
##
##    sed -e "s|@PKG_LIBS@|${PKG_LIBS}|"
##        src/Makevars.in > src/Makevars
##
## which together with the following file
'src/Makevars.in'
##
##    PKG_LIBS = @PKG_LIBS@
##
## can be used to create src/Makevars dynamically. This scheme is more
## powerful and can be expanded to also check for and link with other
## libraries. It should be complemented by a file
'cleanup'
##
##    rm src/Makevars
##
## which removes the autogenerated file src/Makevars.
##
## Of course, autoconf can also be used to write configure files. This is
## done by a number of packages, but recommended only for more advanced
## users comfortable with autoconf and its related tools.


Now the Makevars.win file:


## Use the R_HOME indirection to support installations of multiple
## R version
PKG_LIBS = $(shell "${R_HOME}/bin${R_ARCH_BIN}/Rscript.exe" -e
"Rcpp:::LdFlags()")


This just barely scratches the surface of Rcpp, but should be enough to start a basic package that relies on C++ code. Packages containing C++ code are built the same as any other package, preferably using build in devtools.

24.7. Conclusion

Package building is a great way to make code portable between projects and to share it with other people. A package purely built with R code only requires working functions that can pass the CRAN check using check and proper help files that can be easily built by including roxygen2 documentation above functions and calling document. Building the package is as simple as using build. Packages with C++ should use Rcpp.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset