As of late-July 2013, there were 4,714 packages on CRAN and another 671 on Bioconductor, with more being added daily. In the past, building a package had the potential to be mystifying and complicated but that is no longer the case, especially when using Hadley Wickham’s devtools
package.
All packages submitted to CRAN (or Bioconductor) must follow specific guidelines, including the folder structure of the package, inclusion of DESCRIPTION
and NAMESPACE
files and proper help files.
An R
package is essentially a folder of folders, each containing specific files. At the very minimum there must be two folders, one called R
where the included functions go, and the other called man
where the documentation files are placed. It used to be that the documentation had to be be written manually, but thanks to roxygen2
that is no longer necessary, as is seen in Section 24.3. Starting with R
3.0.0, CRAN is very strict in requiring that all files must end with a blank line and that code examples must be shorter than 105 characters.
In addition to the R
and man
folders, other common folders are src
for compiled code such as C++ and FORTRAN, data
for data that is included in the package and inst
for files that should be available to the end user. No files from the other folders are available in a human-readable form (except the INDEX
, LICENSE
and NEWS
files in the root folder) when a package is installed. Table 24.1 lists the most common folders used in an R
package.
The root folder of the package must contain at least a DESCRIPTION
file and a NAMESPACE
file, which are described in Sections 24.2.1 and 24.2.2. Other files like NEWS
, LICENSE
and README
are recommended but not necessary. Table 24.2 lists commonly used files.
The DESCRIPTION
file contains information about the package, such as its name, version, author and other packages it depends on. The information is entered, each on one line, as Item1: Value1
. Table 24.3 lists a number of fields that are used in DESCRIPTION
files.
The Package
field specifies the name of the package. This is the name that appears on CRAN and how users access the package.
Type
is a bit archaic; it can be either Package
or one other type, Frontend
, which is used for building a graphical front end to R
and will not be helpful for building an R
package of functions.
Title
is a short description of the package. It should be relatively brief and cannot end in a period. Description
is a complete description of the package, which can be several sentences long but no longer than a paragraph.
Version
is the package version and usually consists of three period-separated integers; for example, 1.15.2
. Date
is the release date of the current version.
The Author
and Maintainer
fields are similar but both are necessary. Author
can be multiple people, separated by commas, and Maintainer
is the person in charge, or rather the person who gets complained to, and should be a name followed by an email address inside angle brackets (<>). An example is Maintainer: Jared P. Lander
<[email protected]
>. CRAN is actually very strict about the Maintainer
field and can reject a package for not having the proper format.
License information goes in the appropriately named License
field. It should be either an abbreviation of one of the standard specifications such as GPL-2
or BSD
or the string 'file LICENSE'
referring to the LICENSE
file in the package’s root folder.
Things get tricky with the Depends
, Imports
and Suggests
fields. Often a package requires functions from other packages. In that case the other package, for example, ggplot2
, should be listed in either the Depends
or Imports
field as a comma-separated list. If ggplot2
is listed in Depends
, then when the package is loaded so will ggplot2
, and its functions will be available to functions in the package and to the end user. If ggplot2
is listed in Imports
, then when the package is loaded ggplot2
will not be loaded, and its functions will be available to functions in the package but not the end user. Packages should be listed in one or the other, not both. Packages listed in either of these fields will be automatically installed from CRAN when the package is installed. If the package depends on a specific version of another package, then that package name should be followed by the version number in parentheses; for example, Depends: ggplot2 (>= 0.9.1)
. Packages that are needed for the examples in the documentation, vignettes or testing but are not necessary for the package’s functionality should be listed in Suggests
.
The Collate
field specifies the R
code files contained in the R
folder. This will be populated automatically if the package is documented using roxygen2
and devtools
.
A relatively new feature is byte-compilation, which can significantly speed up R
code. Setting ByteCompile
to TRUE
will ensure the package is byte-compiled when installed by the end user.
The DESCRIPTION
file from coefplot
is shown next.
Package: coefplot
Type: Package
Title: Plots Coefficients from Fitted Models
Version: 1.1.9
Date: 2013-01-23
Author: Jared P. Lander
Maintainer: Jared P. Lander <[email protected]>
Description: Plots the coefficients from a model object
License: BSD
LazyLoad: yes
Depends:
ggplot2
Imports:
plyr,
stringr,
reshape2,
useful,
scales,
proto
Collate:
'coefplot.r'
'coefplot-package.r'
'multiplot.r'
'extractCoef.r'
'buildPlottingFrame.r'
'buildPlot.r'
'dodging.r'
ByteCompile: TRUE
The NAMESPACE
file specifies which functions are exposed to the end user (not all functions in a package should be) and which other packages are imported into the NAMESPACE
. Functions that are exported are listed as export(multiplot)
and imported packages are listed as import(plyr)
. Building this file by hand can be quite tedious, so fortunately roxygen2
and devtools
can, and should, build this file automatically.
R
has three object-oriented systems: S3
, S4
and Reference Classes
. S3
is the oldest and simplest of the systems and is what we will focus on in this book. It consists of a number of generic functions such as print
, summary
, coef
and coefplot
. The generic functions exist only to dispatch object-specific functions. Typing print
into the console shows this.
> print
function (x, ...)
UseMethod("print")
<bytecode: 0x00000000081e2bb0>
<environment: namespace:base>
It is a single-line function containing the command UseMethod("print")
, which tells R
to call another function depending on the class of the object passed. These can be seen with methods(print)
. To save space we show only 20 of the results. Functions not exposed to the end user are marked with an asterisk (*). All of the names are print
and the object class separated by a period.
> methods(print)
[1] print.aareg* print.abbrev*
[3] print.acf* print.AES*
[5] print.agnes* print.anova
[7] print.Anova* print.anova.gam
[9] print.anova.lme* print.anova.loglm*
[11] print.aov* print.aovlist*
[13] print.ar* print.Arima*
[15] print.arima0* print.arma*
[17] print.AsIs print.aspell*
[19] print.aspell_inspect_context* print.balance*
[ reached getOption("max.print") -- omitted 385 entries ]
Non-visible functions are asterisked
When print
is called on an object, it then calls one of these functions depending on the type of object. For instance, a data.frame
is sent to print.data.frame
and an lm
object is sent to print.lm
.
These different object-specific functions that get called by generic S3
functions must be declared in the NAMESPACE
in addition to the functions that are exported. This is indicated as S3Method(coefplot, lm)
to say that coefplot.lm
is registered with the coefplot
generic function.
The NAMESPACE
file from coefplot
is shown next.
S3method(coefplot,default)
S3method(coefplot,glm)
S3method(coefplot,lm)
S3method(coefplot,rxGlm)
S3method(coefplot,rxLinMod)
S3method(coefplot,rxLogit)
S3method(extract.coef,default)
S3method(extract.coef,glm)
S3method(extract.coef,lm)
S3method(extract.coef,rxGlm)
S3method(extract.coef,rxLinMod)
S3method(extract.coef,rxLogit)
export(buildModelCI)
export(coefplot)
export(coefplot.default)
export(coefplot.glm)
export(coefplot.lm)
export(coefplot.rxGlm)
export(coefplot.rxLinMod)
export(coefplot.rxLogit)
export(collidev)
export(extract.coef)
export(multiplot)
export(plotcoef)
export(pos dodgev)
import(ggplot2)
import(plyr)
import(proto)
import(reshape2)
import(scales)
import(stringr)
import(useful)
Even with a small package like coefplot
, building the NAMESPACE
file by hand can be tedious and error prone, so it is best to let devtools
and roxygen2
build it.
The NEWS
file is for detailing what is new or changed in each version. The three most recent entries in the coefplot NEWS
file are shown next. Notice how it is good practice to thank people who helped with or inspired the update. This file will be available to the end user’s installation.
The LICENSE
file is for specifying more detailed information about the package’s license and will be available to the end user’s installation. The LICENSE
file from coefplot
is shown here.
The README
file is purely informational and is not included in the end user’s installation. Its biggest benefit may be for packages hosted on GitHub, where the README
will be the information displayed on the project’s home page.
A very strict requirement for R
packages to be accepted by CRAN is proper documentation. Each exported function in a package needs its own .Rd
file that is written in a LATEX-like syntax. This can be difficult to write for even simple functions like the following one.
> simple.ex <- function(x, y)
+ {
+ return(x * y)
+ }
Even though it has only two arguments and simply returns the product of the two, it has a lot of necessary documentation, shown here.
ame{simple.ex}
alias{simple.ex}
itle{within.distance}
usage{simple.ex(x, y)}
arguments{
item{x}{A numeric}
item{y}{A second numeric}
}
value{x times y}
description{Compute distance threshold}
details{This is a simple example of a function}
author{Jared P. Lander}
examples{
simple.ex(3, 5)
}
Rather than taking this two-step approach, it is better to write function documentation along with the function. That is, the documentation is written in a specially commented out block right above the function, as shown here.
> #' @title simple.ex
> #' @description Simple Example
> #' @details This is a simple example of a function
> #' @aliases simple.ex
> #' @author Jared P. Lander
> #' @export simple.ex
> #' @param x A numeric
> #' @param y A second numeric
> #' @return x times y
> #' @examples
> #' simple.ex(5, 3)
> simple.ex <- function(x, y)
+ {
+ return(x * y)
+ }
Running document
from devtools
will automatically generate the appropriate .Rd
file based on the block of code above the function. The code is indicated by #'
at the beginning of the line. Table 24.4 lists a number of commonly used roxygen2
tags.
Every argument must be documented with a @param
tag, including the dots (. . . ), which are written as dots
. There must be an exact correspondence between @param
tags and arguments; one more or less will cause an error.
It is considered good form to show examples of a function’s usage. This is done on the lines following the @examples
tag. In order to be accepted by CRAN all of the examples must work without error. In order to show, but not run, the examples wrap them in dontrun{...}
.
Knowing the type of object is important when using a function, so @return
should be used to describe the returned object. If the object is a list, the @return
tag should be an itemized list of the form item{name a}{description a}item{name b} {description b}
.
Help pages are typically arrived at by typing ?FunctionName
into the console. The @aliases
tag uses a space-separated list to specify the names that will lead to a particular help file. For instance, using @aliases coefplot plotcoef
will result in both ?coefplot
and ?plotcoef
leading to the same help file.
In order for a function to be exposed to the end user, it must be listed as an export in the NAMESPACE
file. Using @export FunctionName
automatically adds export(FunctionName)
to the NAMESPACE
file. Similarly, to use a function from another package, that package must be imported and @import PackageName
adds import(PackageName)
to the NAMESPACE
file.
When building functions that get called by generic functions, such as coefplot.lm
or print.anova
, the @S3method
tag should be used. @S3method GenericFunction Class
adds S3method(GenericFunction,class)
to the NAMESPACE
file. When using @S3method
it is a good idea to also use @method
with the same arguments. This is shown in the following function.
> #' @title print.myClass
> #' @aliases print.myClass
> #' @method print myClass
> #' @S3method print myClass
> #' @export print.myClass
> #' @param x Simple object
> #' @param ... Further arguments to be passed on
> #' @return The top 5 rows of x
> print.myClass <- function(x, ...)
+ {
+ class(x) <- "list"
+ x <- as.data.frame(x)
+ print.data.frame(head(x, 5))
+ }
Building a package used to require going to the command prompt and using commands like R CMD check
, R CMD build
and R CMD INSTALL
(in Windows it is Rcmd
instead of R CMD
), which required being in the proper directory, knowing the correct options and other bothersome time wasters. Thanks to Hadley Wickham, this has all been made much easier and can be done from within the R
console.
The first step is to make sure a package is properly documented by calling document
. The first argument is the path to the root folder of the package as a string. (If the current working directory is the same as the root folder, then no arguments are even needed. This is true of all the devtools
functions.) This builds all the necessary .Rd
files, the NAMESPACE
file and the Collate
field of the DESCRIPTION
file.
> require(devtools)
> document()
After the package is properly documented, it is time to check it. This is done using check
with the path to the package as the first argument. This will make note of any errors or warnings that would prevent CRAN from accepting the package. CRAN can be very strict, so it is essential to address all the issues.
> check()
Building the package is equally simple using the build
function, which also takes the path to the package as the first argument. By default it builds a .tar.gz
—a collection of all the files in the package—that still needs to be built into a binary that can be installed in R
. It is portable in that it can be built on any operating system. The binary
argument, if set to TRUE
, will build a binary that is operating system specific. This can be problematic if compiled source code is involved.
> build()
> build(binary = TRUE)
Other functions to help with the development process are install
, which rebuilds and loads the package, and load all
, which simulates the loading of the package and NAMESPACE
.
Another great function, not necessarily for the development process so much as for getting other people’s latest work, is install github
, which can install an R
package directly from a GitHub repository. There are analogous functions for installing from BitBucket (install bitbucket
) and Git (install git
) in general.
For instance, to get the latest version of coefplot
the following code should be run. By the time of publication this might no longer be the the latest version.
> install_github(repo = "coefplot", username = "jaredlander",
+ ref = "survival")
Sometimes an older version of a package on CRAN is needed, which under normal circumstances is hard to do without downloading source packages manually and building them. However, install version
was recently added to devtools
, allowing a specific version of a package to be downloaded from CRAN, built and installed.
The best way to get a package out to the R
masses is to have it on CRAN. Assuming the package passed the check using check
from devtools
, it is ready to be uploaded to CRAN using the new Web uploader (as opposed to using FTP) at http://xmpalantir.wu.ac.at/cransubmit/
. The .tar.gz
file is the one to upload. After submission, CRAN will send an email requiring confirmation that the package was indeed uploaded by the maintainer. Alternatively, the package can be uploaded by anonymous FTP to ftp://CRAN.R-project.org/incoming/
with an email sent to Uwe Ligges at [email protected] and to [email protected]. The subject line must be of the format CRAN Upload: PackageName PackageVersion
. The name of the package is case sensitive and must match the name of the package in the DESCRIPTION
file. The body of the message does not have to follow any guidelines, but should be polite and include the words “thank you” somewhere, because the CRAN team puts in an incredible amount of effort despite not getting paid.
Sometimes R
code is just not fast enough (even when byte-compiled) for a given problem and a compiled language must be used. R
’s foundation in C and links to FORTRAN libraries (digging deep enough into certain functions, such as lm
, reveals that the underpinnings are written in FORTRAN) makes incorporating those languages fairly natural. .Fortran
is used for calling a function written in FORTRAN and .Call
is used for calling C and C++ functions.1 Even with those convenient functions, knowledge of either FORTRAN or C/C++ is still necessary, as is knowledge of how R
objects are represented in the underlying language.
1. There is also a .C
function, although despite much debate it is generally frowned upon.
Thanks to Dirk Eddelbuettel and Romain François, integrating C++ code has become much easier using the Rcpp
package. It handles a lot of the scaffolding necessary to make C++ functions callable from R
. Not only did they make developing R
packages with C++ easier, but they also made running ad hoc C++ possible.
A number of tools are necessary for working with C++ code. First, a proper C++ compiler must be available. To maintain compatibility it is best to use gcc.
Linux users should already have gcc installed and should not have a problem, but they might need to install g++.
Mac users need to install Xcode and might have to manually select g++. The compiler offered on Mac generally lags behind the most recent version available, which has been known to cause some issues.
Windows users should actually have an easy time getting started, thanks to RTools developed by Brian Ripley and Duncan Murdoch. It provides all necessary development tools, including gcc and make
. The proper version, depending on the installed version of R
, can be downloaded from http://cran.r-project.org/bin/windows/Rtools/
and installed like any other program. It installs gcc and makes the Windows command prompt act more like a BASH terminal. If building packages from within R
using devtools
and RStudio (which is the best way now), then the location of gcc will be determined from the operating system’s registry. If building packages from the command prompt, then the location of gcc must be put at the very beginning of the system PATH like c:
Rtools
bin;c:
Rtools
gcc-4.6.3
bin;C:
Users
Jared
Documents
R
R-3.0.0
bin
x64
.
A LATEX distribution is needed for building package help documents and vignettes. Table 23.1 lists the primary distributions for the different operating systems.
To start, we build a simple C++ function for adding two vector
s. Doing so does not make sense from a practical point of view because R
already does this natively and quickly, but it will be good for illustrative purposes. The function will have arguments for two vector
s and return the element-wise sum. The // [[Rcpp::export]]
tag tells Rcpp
that the function should be exported for use in R
.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector vector_add(NumericVector x, NumericVector y)
{
// declare the result vector
NumericVector result(x.size());
// loop through the vectors and add them element by element
for(int i=0; i<x.size(); ++i)
{
result[i] = x[i] + y[i];
}
return(result);
}
This function should be saved in a .cpp
file (for example, vector add.cpp
) or as a character
variable so it can be sourced using sourceCpp
, which will automatically compile the code and create a new R
function with the same name that, when called, executes the C++ function.
> require(Rcpp)
> sourceCpp("vector_add.cpp")
Printing the function shows that it points to a temporary location where the compiled function is currently stored.
> vector_add
function (x, y)
.Primitive(".Call")(<pointer: 0x0000000066e81710>, x, y)
The function can now be called just like any other R
function.
> vector_add(x = 1:10, y = 21:30)
[1] 22 24 26 28 30 32 34 36 38 40
> vector_add(1, 2)
[1] 3
> vector_add(c(1, 5, 3, 1), 2:5)
[1] 3 8 7 6
JJ Allaire (the founder of RStudio) is responsible for sourceCpp
, the // [[Rcpp::export]]
shortcut and a lot of the magic that simplifies using C++ with R
in general. Rcpp
maintainer Dirk Eddelbuettel cannot stress enough how helpful Allaire’s contributions have been.
Another nice feature of Rcpp
is the syntactic sugar that allows C++ code to be written like R
. Using sugar we can rewrite vector add
with just one line of code.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector vector_add(NumericVector x, NumericVector y)
{
return(x + y);
}
The syntactic sugar allowed two vector
s to be added just as if they were being added in R
.
Because C++ is a strongly typed language, it is important that function arguments and return types be explicitly declared using the correct type. Typical types are NumericVector
, IntegerVector
, LogicalVector
, CharacterVector
, DataFrame
and List
.
While sourceCpp
makes ad hoc C++ compilation easy, a different tactic is needed for building R
packages using C++ code. The C++ code is put in a .cpp
file inside the src
folder. Any functions preceded by // [[Rcpp::export]]
will be converted into end user facing R
functions when the package is built using build
from devtools
. Any roxygen2
documentation written above an exported C++ function will be used to document the resulting R
function.
The vector add
function should be rewritten using roxygen2
and saved in the appropriate file.
# include <Rcpp.h>
using namespace Rcpp;
//' @title vector_add
//' @description Add two vectors
//' @details Adding two vectors with a for loop
//' @author Jared P. Lander
//' @export vector_add
//' @aliases vector_add
//' @param x Numeric Vector
//' @param y Numeric Vector
//' @return a numeric vector resulting from adding x and y
//' @useDynLib ThisPackage
// [[Rcpp::export]]
NumericVector vector_add(NumericVector x, NumericVector y)
{
NumericVector result(x.size());
for(int i=0; i<x.size(); ++i)
{
result[i] = x[i] + y[i];
}
return(result);
}
The magic is that Rcpp
compiles the code, and then creates a new .R
file in the R
folder with the corresponding R
code. In this case it builds the following.
> # This file was generated
> # by Rcpp::compileAttributes Generator token:
> # 10BE3573-1514-4C36-9D1C-5A225CD40393
>
> #' @title vector_add
> #' @description Add two vectors
> #' @details Adding two vectors with a for loop
> #' @author Jared P. Lander
> #' @export vector_add
> #' @aliases vector_add
> #' @param x Numeric Vector
> #' @param y Numeric Vector
> #' @useDynLib RcppTest
> #' @return a numeric vector resulting from adding x and y
> vector_add <- function(x, y)
+ {
+ .Call("RcppTest_vector_add", PACKAGE = "RcppTest", x, y)
+ }
It is simply a wrapper function that uses .Call
to call the compiled C++ function.
Any functions that are not preceded by // [[Rcpp::export]]
are available to be called from within other C++ functions, but not from R
, using .Call
. Specifying a name attribute in the export statement—like // [[Rcpp::export(name="NewName"]]
—causes the resulting R
function to be called that name. Functions that do not need an R
wrapper function automatically built, but need to be callable using .Call
, should be placed in a separate .cpp
file where // [[Rcpp::interfaces(cpp)]]
is declared and each function that is to be user accessible is preceded by // [[Rcpp::export]]
.
In order to expose its C++ functions, a package’s NAMESPACE
must contain useDynLib(PackageName)
. This can be accomplished by putting the @useDynLibPackageName
tag in any of the roxygen2
blocks. Further, if a package uses Rcpp
the DESCRIPTION
file must list Rcpp
in both the LinkingTo
and Depends
fields. The LinkingTo
field also allows easy linking to other C++ libraries such as RcppArmadillo
, bigmemory
and BH
(Boost).
The src
folder of the package must also contain Makevars
and Makevars.win
files to help with compilation. The following examples were automatically generated using Rcpp.package.skeleton
and should be sufficient for most packages.
## Use the R_HOME indirection to support installations of multiple
## R version
PKG_LIBS = `$(R_HOME)/bin/Rscript -e "Rcpp:::LdFlags()"`
## As an alternative, one can also add this code in a file 'configure'
##
## PKG_LIBS=`${R_HOME}/bin/Rscript -e "Rcpp:::LdFlags()"`
##
## sed -e "s|@PKG_LIBS@|${PKG_LIBS}|"
## src/Makevars.in > src/Makevars
##
## which together with the following file 'src/Makevars.in'
##
## PKG_LIBS = @PKG_LIBS@
##
## can be used to create src/Makevars dynamically. This scheme is more
## powerful and can be expanded to also check for and link with other
## libraries. It should be complemented by a file 'cleanup'
##
## rm src/Makevars
##
## which removes the autogenerated file src/Makevars.
##
## Of course, autoconf can also be used to write configure files. This is
## done by a number of packages, but recommended only for more advanced
## users comfortable with autoconf and its related tools.
Now the Makevars.win
file:
## Use the R_HOME indirection to support installations of multiple
## R version
PKG_LIBS = $(shell "${R_HOME}/bin${R_ARCH_BIN}/Rscript.exe" -e
"Rcpp:::LdFlags()")
This just barely scratches the surface of Rcpp
, but should be enough to start a basic package that relies on C++ code. Packages containing C++ code are built the same as any other package, preferably using build
in devtools
.
Package building is a great way to make code portable between projects and to share it with other people. A package purely built with R
code only requires working functions that can pass the CRAN check using check
and proper help files that can be easily built by including roxygen2
documentation above functions and calling document
. Building the package is as simple as using build
. Packages with C++ should use Rcpp
.