Writing data analysis code is hard. This chapter is about what happens when things go wrong, and how to avoid that happening in the first place. We start with the different types of feedback that you can give to indicate a problem, working up to errors. Then we look at how to handle those errors when they are thrown, and how to debug code to eliminate the bad errors. A look at unit testing frameworks gives you the skills to avoid writing buggy code.
Next, we see some magic tricks: converting strings into code and code into strings (“Just like that!” as Tommy Cooper used to say). The chapter concludes with an introduction to some of the object-oriented programming systems in R.
After reading this chapter, you should:
RUnit
and testthat
unit testing frameworks
We’ve seen the print
function on many occasions for displaying variables to the console. For displaying diagnostic information about the state of the program, R has three functions. In increasing order of severity, they are message
, warning
, and stop
.
message
concatenates its inputs without spaces and writes them to the console. Some common uses are providing status updates for long-running functions, notifying users of new behavior when you’ve changed a function, and providing information on default arguments:
f<-
function
(
x)
{
message(
"'x' contains "
,
toString(
x))
x}
f(
letters[
1
:5
])
## 'x' contains a, b, c, d, e
## [1] "a" "b" "c" "d" "e"
The main advantage of using message
over print
(or the lower-level cat
) is that the user can turn off their display. It may seem trivial, but when you are repeatedly running the same code, not seeing the same message 100 times can have a wonderful effect on morale:
suppressMessages(
f(
letters[
1
:5
]))
## [1] "a" "b" "c" "d" "e"
Warnings behave very similarly to messages, but have a few extra features to reflect their status as indicators of bad news. Warnings should be used when something has gone wrong, but not so wrong that your code should just give up. Common use cases are bad user inputs, poor numerical accuracy, or unexpected side effects:
g<-
function
(
x)
{
if
(
any(
x<
0
))
{
warning(
"'x' contains negative values: "
,
toString(
x[
x<
0
]))
}
x}
g(
c(
3
,
-7
,
2
,
-9
))
## Warning: 'x' contains negative values: -7, -9
## [1] 3 -7 2 -9
As with messages, warnings can be suppressed:
suppressWarnings(
g(
c(
3
,
-7
,
2
,
-9
)))
## [1] 3 -7 2 -9
There is a global option, warn
, that determines how warnings are handled. By default warn
takes the value 0
, which means that warnings are displayed when your code has finished running.
You can see the current level of the warn
option using getOption
:
getOption(
"warn"
)
## [1] 1
If you change this value to be less than zero, all warnings are ignored:
old_ops<-
options(
warn=
-1
)
g(
c(
3
,
-7
,
2
,
-9
))
## [1] 3 -7 2 -9
It is usually dangerous to completely turn off warnings, though, so you should reset the options to their previous state using:
options(
old_ops)
Setting warn
to 1
means that warnings are displayed as they occur, and a warn
value of 2
or more means that all warnings are turned into errors.
You can access the last warning by typing last.warning
.
I mentioned earlier that if the warn
option is set to 0
, then warnings are shown when your code finishes running. Actually, it’s a little more complicated than that. If 10 or fewer warnings were generated, then this is what happens. But if there were more than 10 warnings, you get a message stating how many warnings were generated, and you have to type warnings()
to see them. This is demonstrated in Figure 16-1.
Errors are the most serious condition, and throwing them halts further execution. Errors should be used when a mistake has occurred or you know a mistake will occur. Common reasons include bad user input that can’t be corrected (by using an as.*
function, for example), the inability to read from or write to a file, or a severe numerical error:
h<-
function
(
x,
na.rm=
FALSE
)
{
if
(
!na.rm&&
any(
is.na(
x)))
{
stop(
"'x' has missing values."
)
}
x}
h(
c(
1
,
NA
))
## Error: 'x' has missing values.
stopifnot
throws an error if any of the expressions passed to it evaluate to something that isn’t true. It provides a simple way of checking that the state of your program is as expected:
h<-
function
(
x,
na.rm=
FALSE
)
{
if
(
!na.rm)
{
stopifnot(
!any(
is.na(
x)))
}
x}
h(
c(
1
,
NA
))
## Error: !any(is.na(x)) is not TRUE
For a more extensive set of human-friendly tests, use the assertive
package:
library(
assertive)
h<-
function
(
x,
na.rm=
FALSE
)
{
if
(
!na.rm)
{
assert_all_are_not_na(
x)
}
x}
h(
c(
1
,
NA
))
## Error: x contains NAs.
Some tasks are inherently risky. Reading from and writing to files or databases is notoriously error prone, since you don’t have complete control over the filesystem, or the network or database. In fact, any time that R interacts with other software (Java code via rJava
, WinBUGS via R2WinBUGS
, or any of the hundreds of other pieces of software that R can connect to), there is an inherent risk that something will go wrong.
For these dangerous tasks,[61] you need to decide what to do when problems occur. Sometimes it isn’t useful to stop execution when an error is thrown. For example, if you are looping over files importing them, then if one import fails you don’t want to just stop executing and lose all the data that you’ve successfully imported already.
In fact, this point generalizes: any time you are doing something risky in a loop, you don’t want to discard your progress if one of the iterations fails. In this next example, we try to convert each element of a list into a data frame:
to_convert<-
list(
first=
sapply(
letters[
1
:5
],
charToRaw),
second=
polyroot(
c(
1
,
0
,
0
,
0
,
1
)),
third=
list(
x=
1
:2
,
y=
3
:5
)
)
If we run the code nakedly, it fails:
lapply(
to_convert,
as.data.frame)
## Error: arguments imply differing number of rows: 2, 3
Oops! The third element fails to convert because of differing element lengths, and we lose everything.
The simplest way of protecting against total failure is to wrap the failure-prone code inside a call to the try
function:
result<-
try(
lapply(
to_convert,
as.data.frame))
Now, although the error will be printed to the console (you can suppress this by passing silent = TRUE
), execution of code won’t stop.
If the code passed to a try
function executes successfully (without throwing an error), then result
will just be the result of the calculation, as usual. If the code fails, then result
will be an object of class try-error
. This means that after you’ve written a line of code that includes try
, the next line should always look something like this:
if
(
inherits(
result,
"try-error"
))
{
#special error handling code
}
else
{
#code for normal execution
}
## NULL
Since you have to include this extra line every time, code using the try
function is a bit ugly. A prettier alternative[62] is to use tryCatch
. tryCatch
takes an expression to safely run, just as try
does, but also has error handling built into it.
To handle an error, you pass a function to an argument named error
. This error
argument accepts an error (technically, an object of class simpleError
) and lets you manipulate, print, or ignore it as you see fit. If this sounds complicated, don’t worry: it’s easier in practice. In this next example, when an error is thrown, we print the error message and return an empty data frame:
tryCatch(
lapply(
to_convert,
as.data.frame),
error=
function
(
e)
{
message(
"An error was thrown: "
,
e$
message)
data.frame()
}
)
## An error was thrown: arguments imply differing number of rows: 2, 3
## data frame with 0 columns and 0 rows
tryCatch
has one more trick: you can pass an expression to an argument named finally
, which runs whether an error was thrown or not (just like the on.exit
function we saw when we were connecting to databases).
Despite having played with try
and tryCatch
, we still haven’t solved our problem: when looping over things, if an error is thrown, we want to keep the results of the iterations that worked.
To achieve this, we need to put try
or tryCatch
inside the loop:
lapply(
to_convert,
function
(
x)
{
tryCatch(
as.data.frame(
x),
error=
function
(
e)
NULL
)
}
)
## $first ## x ## a 61 ## b 62 ## c 63 ## d 64 ## e 65 ## ## $second ## x ## 1 0.7071+0.7071i ## 2 -0.7071+0.7071i ## 3 -0.7071-0.7071i ## 4 0.7071-0.7071i ## ## $third ## NULL
Since this is a common piece of code, the plyr
package contains a function, tryapply
, that deals with exactly this case in a cleaner fashion:
tryapply(
to_convert,
as.data.frame)
## $first ## x ## a 61 ## b 62 ## c 63 ## d 64 ## e 65 ## ## $second ## x ## 1 0.7071+0.7071i ## 2 -0.7071+0.7071i ## 3 -0.7071-0.7071i ## 4 0.7071-0.7071i
Eagle-eyed observers may notice that the failures are simply removed in this case.
All nontrivial software contains errors.[63] When problems happen, you need to be able to find where they occur, and hopefully find a way to fix them. This is especially true if it’s your own code. If the problem occurs in the middle of a simple script, you usually have access to all the variables, so it is trivial to locate the problem.
More often than not, problems occur somewhere deep inside a function inside another function inside another function. In this case, you need a strategy to inspect the state of the program at each level of the call stack. (“Call stack” is just jargon for the list of functions that have been called to get you to this point in the code.)
When an error is thrown, the traceback
function tells you where the last error occurred. First, let’s define some functions in which the error can occur:
outer_fn<-
function
(
x)
inner_fn(
x)
inner_fn<-
function
(
x)
exp(
x)
Now let’s call outer_fn
(which then calls inner_fn
) with a bad input:
outer_fn(
list(
1
))
## Error: non-numeric argument to mathematical function
traceback
now tells us the functions that we called before tragedy struck (see Figure 16-2).
In general, if it isn’t an obvious bug, we don’t know where in the call stack the problem occurred. One reasonable strategy is to start in the function where the error was thrown, and work our way up the stack if we need to. To do this, we need a way to stop execution of the code close to the point where the error was thrown. One way to do this is to add a call to the browser
function just before the error point (we know where the error occurred because we used traceback
):
inner_fn<-
function
(
x)
{
browser()
#execution pauses here
exp(
x)
}
browser
halts execution when it is reached, giving us time to inspect the program. A really good idea in most cases is to call ls.str
to see the values of all the variables that are in play at the time. In this case we see that x
is a list, not a numeric vector, causing exp
to fail.
An alternative strategy for spotting errors is to set the global error
option. This strategy is preferable when the error lies inside someone else’s package, where it is harder to stick a call to browser
. (You can alter functions inside installed packages using the fixInNamespace
function. The changes persist until you close R.)
The error
option accepts a function with no arguments, and is called whenever an error is thrown. As a simple example, we can set it to print a message after the error has occurred, as shown in Figure 16-3.
While a sympathetic message may provide a sliver of consolation for the error, it isn’t very helpful in terms of fixing the problem. A much more useful alternative is provided in the recover
function that ships with R. recover
lets you step into any function in the call stack after an error has been thrown (see Figure 16-4).
You can also step through a function line by line using the debug
function. This is a bit boring with trivial single-line functions like inner
and outer
, so we’ll test it on a more substantial offering. buggy_count
, included in the learningr
package, is a buggy version of the count
function from the plyr
package that fails in an obscure way when you pass it a factor. Pressing Enter at the command line without typing anything lets us step through it until we find the problem:
debug(
buggy_count)
x<-
factor(
sample(
c(
"male"
,
"female"
),
20
,
replace=
TRUE
))
buggy_count(
x)
count
(and by extension, our buggy_count
) accepts a data frame or a vector as its first argument. If the df
argument is a vector, then the function inserts it into a data frame.
Figure 16-5 shows what happens when we reach this part of the code. When df
is a factor, we want it to be placed inside a data frame. Unfortunately, is.vector
returns FALSE
for factors, and the step is ignored. Factors aren’t considered to be vectors, because they have attributes other than names. What the code really should contain (and does in the proper version of plyr
) is a call to is.atomic
, which is TRUE
for factors as well as other vector types, like numeric
.
To exit the debugger, type Q at the command line. With the debug
function, the debugger will be started every time that function is called. To turn off debugging, call undebug
:
undebug(
buggy_count)
As an alternative, use debugonce
, which only calls the debugger the first time a function is called.[64]
To make sure that your code isn’t buggy and awful, it is important to test it. Unit testing is the concept of testing small chunks of code; in R this means testing at the functional level. (System or integration testing is the larger-scale testing of whole pieces of software, but that is more useful for application development than data analysis.)
Each time you change a function, you can break code in other functions that rely on it. This means that each time you change a function, you need to test everything that it could affect. Attempted manually, this is impossible, or at least time-consuming and boring enough that you won’t do it. Consequently, you need to automate the task. In R, you have two choices for this:
RUnit
has “xUnit” syntax, meaning that it’s very similar to Java’s JUnit
, .NET’s NUnit
, Python’s PyUnit
, and a whole other family of unit testing suites. This makes it easiest to learn if you’ve done unit testing in any other language.
testthat
has its own syntax, and a few extra features. In particular, the caching of tests makes it much faster for large projects.
Let’s test the hypotenuse
function we wrote when we first learned about functions in Functions. It uses the obvious algorithm that you might use for pen and paper calculations.[65] The function is included in the learningr
package:
hypotenuse<-
function
(
x,
y)
{
sqrt(
x^
2
+
y^
2
)
}
In RUnit
, each test is a function that takes no inputs. Each test compares the actual result of running some code (in this case, calling hypotenuse
) to an expected value, using one of the check*
functions contained in the package. In this next example we use checkEqualsNumeric
, since we are comparing two numbers:
library(
RUnit)
test.hypotenuse.3_4.returns_5<-
function
()
{
expected<-
5
actual<-
hypotenuse(
3
,
4
)
checkEqualsNumeric(
expected,
actual)
}
There is no universal naming convention for tests, but RUnit
looks for functions with names beginning with test
by default. The convention used here is designed to maximize clarity. Tests take the name form of test.
name_of_function.description_of_inputs
.returns_
a_value
.
Sometimes we want to make sure that a function fails in the correct way. For example, we can test that hypotenuse
fails if no inputs are provided:
test.hypotenuse.no_inputs.fails<-
function
()
{
checkException(
hypotenuse())
}
Many algorithms suffer loss of precision when given very small or very large inputs, so it is good practice to test those conditions. The smallest and largest positive numeric values that R can represent are given by the double.xmin
and double.xmax
components of the built-in .Machine
constant:
.
Machine$
double.xmin
## [1] 2.225e-308
.
Machine$
double.xmax
## [1] 1.798e+308
For the small and large tests, we pick values close to these limits. In the case of small numbers, we need to manually tighten the tolerance
of the test. By default, checkEqualsNumeric
considers its test passed when the actual result is within about 1e-8
of the expected result (it uses absolute, not relative differences). We set this value to be a few orders of magnitude smaller than the inputs to make sure that the test fails appropriately:
test.hypotenuse.very_small_inputs.returns_small_positive<-
function
()
{
expected<-
sqrt(
2
)
*
1
e-
300
actual<-
hypotenuse(
1
e-
300
,
1
e-
300
)
checkEqualsNumeric(
expected,
actual,
tolerance=
1
e-
305
)
}
test.hypotenuse.very_large_inputs.returns_large_finite<-
function
()
{
expected<-
sqrt(
2
)
*
1
e300 actual<-
hypotenuse(
1
e300,
1
e300)
checkEqualsNumeric(
expected,
actual)
}
There are countless more possible tests; for example, what happens if we pass missing values or NULL
or infinite values or character values or vectors or matrices or data frames, or we expect an answer in non-Euclidean space? Thorough testing works your imagination hard. Unleash your inner two-year-old and contemplate breaking stuff. On this occasion, we’ll stop here. Save all your tests into a file; RUnit
defaults to looking for files with names that begin with “runit” and have a .R file extension. These tests can be found in the tests directory of the learningr
package.
Now that we have some tests, we need to run them. This is a two-step process.
First, we define a test suite with defineTestSuite
. This function takes a string for a name (used in its output), and a path to the directory where your tests are contained. If you’ve named your test functions or files in a nonstandard way, you can provide a pattern to identify them:
test_dir<-
system.file(
"tests"
,
package=
"learningr"
)
suite<-
defineTestSuite(
"hypotenuse suite"
,
test_dir)
The second step is to run them with runTestSuite
(additional line breaks have been added here as needed, to fit the formatting of the book):
runTestSuite(
suite)
## ## ## Executing test function test.hypotenuse.3_4.returns_5 ... ## done successfully. ## ## ## ## Executing test function test.hypotenuse.no_inputs.fails ... ## done successfully. ## ## ## ## Executing test function ## test.hypotenuse.very_large_inputs.returns_large_finite ... ## Timing stopped at: 0 0 0 done successfully. ## ## ## ## Executing test function ## test.hypotenuse.very_small_inputs.returns_small_positive ... ## Timing stopped at: 0 0 0 done successfully.
## Number of test functions: 4 ## Number of errors: 0 ## Number of failures: 2
This runs each test that it finds and displays whether it passed, failed, or threw an error. In this case, you can see that the small and large input tests failed. So what went wrong?
The problem with our algorithm is that we have to square each input. Squaring big numbers makes them larger than the largest (double-precision) number that R can represent, so the result comes back as infinity. Squaring very small numbers makes them even smaller, so that R thinks they are zero. (There are better algorithms that avoid this problem; see the ?hypotenuse
help page for links to a discussion of better algorithms for real-world use.)
RUnit
has no built-in checkWarning
function to test for warnings. To test that a warning has been thrown, we need a trick: we set the warn
option to 2
so that warnings become errors, and then restore it to its original value when the test function exits using on.exit
. Recall that code inside on.exit
is run when a function exits, regardless of whether it completed successfully or an error was thrown:
test.log.minus1.throws_warning<-
function
()
{
old_ops<-
options(
warn=
2
)
#warnings become errors
on.exit(
old_ops)
#restore old behavior
checkException(
log(
-1
))
}
Though testthat
has a different syntax, the principles are almost the same. The main difference is that rather than each test being a function, it is a call to one of the expect_*
functions in the package. For example, expect_equal
is the equivalent of RUnit
’s checkEqualsNumeric
. The translated tests (also available in the tests directory of the learningr
package) look like this:
library(
testthat)
expect_equal(
hypotenuse(
3
,
4
),
5
)
expect_error(
hypotenuse())
expect_equal(
hypotenuse(
1
e-
300
,
1
e-
300
),
sqrt(
2
)
*
1
e-
300
,
tol=
1
e-
305
)
expect_equal(
hypotenuse(
1
e300,
1
e300),
sqrt(
2
)
*
1
e300)
To run this, we call test_file
with the name of the file containing tests, or test_dir
with the name of the directory containing the files containing the tests. Since we have only one file, we’ll use test_file
:
filename<-
system.file(
"tests"
,
"testthat_hypotenuse_tests.R"
,
package=
"learningr"
)
test_file(
filename)
## ..12 ## ## 1. Failure: (unknown) ----------------------------------------------------- ## learningr::hypotenuse(1e-300, 1e-300) not equal to sqrt(2) * 1e-300 ## Mean relative difference: 1 ## ## 2. Failure: (unknown) ----------------------------------------------------- ## learningr::hypotenuse(1e+300, 1e+300) not equal to sqrt(2) * 1e+300 ## Mean relative difference: Inf
There are two variations for running the tests: test_that
tests code that you type at the command line (or, more likely, copy and paste), and test_package
runs all tests from a package, making it easier to test nonexported functions.
Unlike with RUnit
, warnings can be tested for directly via expect_warning
:
expect_warning(
log(
-1
))
The source code that we write, as it exists in a text editor, is just a bunch of strings. When we run that code, R needs to interpret what those strings contain and perform the appropriate action. It does that by first turning the strings into one of several language variable types. And sometimes we want to do the opposite thing, converting language variables into strings.
Both these tasks are rather advanced, dark magic. As is the case with magic in every movie ever, if you use it without understanding what you are doing, you’ll inevitably suffer nasty, unexpected consequences. On the other hand, used knowledgeably and sparingly, there are some useful tricks that you can put up your sleeve.
Whenever you type a line of code at the command line, R has to turn that string into something it understands. Here’s a simple call to the arctangent function:
atan(
c(
-
Inf,
-1
,
0
,
1
,
Inf))
## [1] -1.5708 -0.7854 0.0000 0.7854 1.5708
We can see what happens to this line of code in slow motion by using the quote
function. quote
takes a function call like the one in the preceding line, and returns an object of class call
, which represents an “unevaluated function call”:
(
quoted_r_code<-
quote(
atan(
c(
-
Inf,
-1
,
0
,
1
,
Inf))))
## atan(c(-Inf, -1, 0, 1, Inf))
class(
quoted_r_code)
## [1] "call"
The next step that R takes is to evaluate that call. We can mimic this step using the eval
function:
eval(
quoted_r_code)
## [1] -1.5708 -0.7854 0.0000 0.7854 1.5708
The general case, then, is that to execute code that you type, R does something like eval(quote(
the stuff you typed at the command line
))
.
To understand the call
type a little better, let’s convert it to a list:
as.list(
quoted_r_code)
## [[1]] ## atan ## ## [[2]] ## c(-Inf, -1, 0, 1, Inf)
The first element is the function that was called, and any additional elements contain the arguments that were passed to it.
One important thing to remember is that in R, more or less everything is a function. That’s a slight exaggeration, but operators like +
; language constructs like switch
, if
, and for
; and assignment and indexing are functions:
vapply(
list(
`+`, `if`, `for`, `<-`, `[`, `[[`
),
is.function,
logical(
1
)
)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE
The upshot of this is that anything that you type at the command line is really a function call, which is why this input is turned into call
objects.
All of this was a long-winded way of saying that sometimes we want to take text that is R code and get R to execute it. In fact, we’ve already seen two functions that do exactly that for special cases: assign
takes a string and assigns a value to a variable with that name, and its reverse, get
, retrieves a variable based upon a string input.
Rather than just limiting ourselves to assigning and retrieving variables, we might occasionally decide that we want to take an arbitrary string of R code and execute it. You may have noticed that when we use the quote
function, we just type the R code directly into it, without wrapping it in—ahem—quotes. If our input is a string (as in a character vector of length one), then we have a slightly different problem: we must “parse” the string. Naturally, this is done with the parse
function.
parse
returns an expression
object rather than a call. Before you get frightened, note that an expression
is basically just a list of calls.
The exact nature of calls and expressions is deep, dark magic, and I don’t want to be responsible for the ensuing zombie apocalypse when you try to raise the dead using R. If you are interested in arcana, read Chapter 6 of the R Language Definition manual that ships with R.
When we call parse
in this way, we must explicitly name the text
argument:
parsed_r_code<-
parse(
text=
"atan(c(-Inf, -1, 0, 1, Inf))"
)
class(
parsed_r_code)
## [1] "expression"
Just as with the quoted R code, we use eval
to evaluate it:
eval(
parsed_r_code)
## [1] -1.5708 -0.7854 0.0000 0.7854 1.5708
This sort of mucking about with evaluating strings is a handy trick, but the resulting code is usually fragile and fiendish to debug, making your code unmaintainable. This is the zombie (code) apocalypse mentioned above.
There are a few occasions when we want to solve the opposite problem: turning code into a string. The most common reason for this is to use the name of a variable that was passed into a function. The base
histogram-drawing function, hist
, includes a default title that tells you the name of the data variable:
random_numbers<-
rt(
1000
,
2
)
hist(
random_numbers)
To replicate this technique ourselves, we need two functions: substitute
and deparse
. substitute
takes some code and returns a language object. That usually means a call
, like we would have created using quote
, but occasionally it’s a name
object, which is a special type that holds variable names. (Don’t worry about the details, this section is called “Magic” for a reason.)
The next step is to turn this language object into a string. This is called deparsing. The technique can be very useful for providing helpful error messages when you check user inputs to functions. Let’s see the deparse
-substitute
combination in action:
divider<-
function
(
numerator,
denominator)
{
if
(
denominator==
0
)
{
denominator_name<-
deparse(
substitute(
denominator))
warning(
"The denominator, "
,
sQuote(
denominator_name),
", is zero."
)
}
numerator/
denominator}
top<-
3
bottom<-
0
divider(
top,
bottom)
## Warning: The denominator, 'bottom', is zero.
## [1] Inf
substitute
has one more trick up its sleeve when used in conjunction with eval
. eval
lets you pass it an environment or a data frame, so you can tell R where to look to evaluate the expression.
As a simple example, we can use this trick to retrieve the levels of the Gender
column of the hafu
dataset:
eval(
substitute(
levels(
Gender)),
hafu)
## [1] "F" "M"
This is exactly how the with
function works:
with(
hafu,
levels(
Gender))
## [1] "F" "M"
In fact, many functions use the technique: subset
uses it in several places, and lattice
plots use the trick to parse their formulae. There are a few variations on the trick described in Thomas Lumley’s “Standard nonstandard evaluation rules.”
Most R code that we’ve seen so far is functional-programming-inspired imperative programming. That is, functions are first-class objects, but we usually end up with a data-analysis script that executes one line at a time.
In a few circumstances, it is useful to use an object-oriented programming (OOP) style. This means that data is stored inside a class along with the functions that are allowed to act on it. It is an excellent tool for managing complexity in larger programs, and is particularly suited to GUI development (in R or elsewhere). See Michael Lawrence’s Programming Graphical User Interfaces in R for more on that topic.
R has six (count ‘em) different OOP systems, but don’t let that worry you—there are only two of them that you’ll need for new projects.
Three systems are built into R:
Three other systems are available in add-on packages (but for new code, you will usually want to use reference classes instead):
proto
is a lightweight wrapper around environments for prototype-based programming.
R.oo
extends S3 into a fully fledged OOP system.
OOP
is a precursor to reference classes, now defunct.
In many object-oriented programming languages, functions are called methods. In R, the two words are interchangeable, but “method” is often used in an OOP context.
Sometimes we want a function to behave differently depending upon the type of input. A classic example is the print
function, which gives a different style of output for different variables. S3 lets us call a different function for printing different kinds of variables, without having to remember the names of each one.
The print
function is very simple—just one line, in fact:
## function (x, ...) ## UseMethod("print") ## <bytecode: 0x0000000018fad228> ## <environment: namespace:base>
It takes an input, x
(and ...
; the ellipsis is necessary), and calls UseMethod("print")
. UseMethod
checks the class of x
and looks for another function named print.class_of_x
, calling it if it is found. If it can’t find a function of that name, it tries to call print.default
.
For example, if we want to print a Date
variable, then we can just type:
today<-
Sys.Date()
print(
today)
## [1] "2013-07-17"
print
calls the Date
-specific function print.Date
:
print.Date
## function (x, max = NULL, ...) ## { ## if (is.null(max)) ## max <- getOption("max.print", 9999L) ## if (max < length(x)) { ## print(format(x[seq_len(max)]), max = max, ...) ## cat(" [ reached getOption("max.print") -- omitted", ## length(x) - max, "entries ] ") ## } ## else print(format(x), max = max, ...) ## invisible(x) ## } ## <bytecode: 0x0000000006dc19f0> ## <environment: namespace:base>
Inside print.Date
, our date is converted to a character vector (via format
), and then print
is called again. There is no print.character
function, so this time UseMethod
delegates to print.default
, at which point our date string appears in the console.
If a class-specific method can’t be found, and there is no default method, then an error is thrown.
You can see all the available methods for a function with the methods
function. The print
function has over 100 methods, so here we just show the first few:
head(
methods(
print))
## [1] "print.abbrev" "print.acf" "print.AES" ## [4] "print.anova" "print.Anova" "print.anova.loglm"
methods(
mean)
## [1] mean.Date mean.default mean.difftime mean.POSIXct mean.POSIXlt ## [6] mean.times* mean.yearmon* mean.yearqtr* mean.zoo* ## ## Non-visible functions are asterisked
If you use dots in your function names, like data.frame
, then it can get confusing as to which S3 method gets called. For example, print.data.frame
could mean a print.data
method for a frame
input, as well as the correct sense of a print
method for a data.frame
object. Consequently, using lower_under_case
or lowerCamelCase
is preferred for new function names.
Reference classes are closer to a classical OOP system than S3 and S4, and should be moderately intuitive to anyone who has used classes in C++ or its derivatives.
A class is the general template for how the variables should be structured. An object is a particular instance of the class. For example, 1:10
is an object of class numeric
.
The setRefClass
function creates the template for a class. In R terminology, it’s called a class generator. In some other languages, it would be called a class factory.
Let’s try to build a class for a 2D point as an example. A call to setRefClass
looks like this:
my_class_generator<-
setRefClass(
"MyClass"
,
fields=
list(
#data variables are defined here
),
methods=
list(
#functions to operate on that data go here
initialize=
function
(
...
)
{
#initialize is a special function called
#when an object is created.
}
)
)
Our class needs x
and y
coordinates to store its location, and we want these to be numeric.
In the following example, we declare x
and y
to be numeric:
If we didn’t care about the class of x
and y
, we could declare them with the special value ANY
.
point_generator<-
setRefClass(
"point"
,
fields=
list(
x=
"numeric"
,
y=
"numeric"
),
methods=
list(
#TODO
)
)
This means that if we try to assign them values of another type, an error will be thrown. Purposely restricting user input may sound counterintuitive, but it can save you from having more obscure bugs further down the line.
Next we need to add an initialize
method. This is called every time we create a point
object. This method takes x
and y
input numbers and assigns them to our x
and y
fields. There are three interesting things to note about it:
<<-
, is used to assign to a field. Local assignment (using <-
) just creates a local variable inside the method.
initialize
work without being passed any arguments, since it makes inheritance easier, as we’ll see in a moment. This is why the x
and y
arguments have default values.[66]
With the initialize
method, our class generator now looks like this:
point_generator<-
setRefClass(
"point"
,
fields=
list(
x=
"numeric"
,
y=
"numeric"
),
methods=
list(
initialize=
function
(
x=
NA_real_,
y=
NA_real_)
{
"Assign x and y upon object creation."
x<<-
x y<<-
y}
)
)
Our point class generator is finished, so we can now create a point
object. Every generator has a new
method for this purpose. The new
method calls initialize
(if it exists) as part of the object creation process:
(
a_point<-
point_generator$
new(
5
,
3
))
## Reference class object of class "point" ## Field "x": ## [1] 5 ## Field "y": ## [1] 3
Generators also have a help
method that returns the help string for a method that you specify:
point_generator$
help(
"initialize"
)
## Call: ## $initialize(x = , y = ) ## ## ## Assign x and y upon object creation.
You can provide a more traditional interface to object-oriented code by wrapping class methods inside other functions. This can be useful if you want to distribute your code to other people without having to teach them about OOP:
create_point<-
function
(
x,
y)
{
point_generator$
new(
x,
y)
}
At the moment, the class isn’t very interesting because it doesn’t do anything. Let’s redefine it with some more methods:
point_generator<-
setRefClass(
"point"
,
fields=
list(
x=
"numeric"
,
y=
"numeric"
),
methods=
list(
initialize=
function
(
x=
NA_real_,
y=
NA_real_)
{
"Assign x and y upon object creation."
x<<-
x y<<-
y},
distanceFromOrigin=
function
()
{
"Euclidean distance from the origin"
sqrt(
x^
2
+
y^
2
)
},
add=
function
(
point)
{
"Add another point to this point"
x<<-
x+
point$
x y<<-
y+
point$
y.
self}
)
)
These additional methods belong to point
objects, unlike new
and help
, which belong to the class generator (in OOP terminology, new
and help
are static methods):
a_point<-
create_point(
3
,
4
)
a_point$
distanceFromOrigin()
## [1] 5
another_point<-
create_point(
4
,
2
)
(
a_point$
add(
another_point))
## Reference class object of class "point" ## Field "x": ## [1] 7 ## Field "y": ## [1] 6
As well as new
and help
, generator classes have a few more methods. fields
and methods
respectively list the fields and methods of that class, and lock
makes a field read-only:
point_generator$
fields()
## x y ## "numeric" "numeric"
point_generator$
methods()
## [1] "add" "callSuper" "copy" ## [4] "distanceFromOrigin" "export" "field" ## [7] "getClass" "getRefClass" "import" ## [10] "initFields" "initialize" "show" ## [13] "trace" "untrace" "usingMethods"
Some other methods can be called either from the generator object or from instance objects. show
prints the object, trace
and untrace
let you use the trace
function on a method, export
converts the object to another class type, and copy
makes a copy.
Reference classes support inheritance, where classes can have children to extend their functionality. For example, we can create a three-dimensional point class that contains our original point class, but includes an extra z
coordinate.
A class inherits fields and methods from another class by using the contains
argument:
three_d_point_generator<-
setRefClass(
"three_d_point"
,
fields=
list(
z=
"numeric"
),
contains=
"point"
,
#this line lets us inherit
methods=
list(
initialize=
function
(
x,
y,
z)
{
"Assign x and y upon object creation."
x<<-
x y<<-
y z<<-
z}
)
)
a_three_d_point<-
three_d_point_generator$
new(
3
,
4
,
5
)
At the moment, our distanceFromOrigin
function is wrong, since it doesn’t take the z
dimension into account:
a_three_d_point$
distanceFromOrigin()
#wrong!
## [1] 5
We need to override it in order for it to make sense in the new class. This is done by adding a method with the same name to the class generator:
three_d_point_generator<-
setRefClass(
"three_d_point"
,
fields=
list(
z=
"numeric"
),
contains=
"point"
,
methods=
list(
initialize=
function
(
x,
y,
z)
{
"Assign x and y upon object creation."
x<<-
x y<<-
y z<<-
z},
distanceFromOrigin=
function
()
{
"Euclidean distance from the origin"
sqrt(
x^
2
+
y^
2
+
z^
2
)
}
)
)
To use the updated definition, we need to recreate our point:
a_three_d_point<-
three_d_point_generator$
new(
3
,
4
,
5
)
a_three_d_point$
distanceFromOrigin()
## [1] 7.071
Sometimes we want to use methods from the parent class (a.k.a. superclass). The callSuper
method does exactly this, so we could have written our 3D distanceFromOrigin
(inefficiently) like this:
distanceFromOrigin=
function
()
{
"Euclidean distance from the origin"
two_d_distance<-
callSuper()
sqrt(
two_d_distance^
2
+
z^
2
)
}
OOP is a big topic, and even limited to reference classes, it’s worth a book in itself. John Chambers (creator of the S language, R Core member, and author of the reference classes code) is currently writing a book on OOP in R. Until that materializes, the ?ReferenceClasses
help page is currently the definitive reference-class reference.
try
or tryCatch
lets you control how you handle errors.
debug
function and its relatives help you debug functions.
RUnit
and testthat
packages let you do unit testing.
try
if an error was thrown?
checkException
function. What is the testthat
equivalent?
print
function do different things for different types of input?
1 / mean(1 / x)
, where x
contains positive numbers. Write a harmonic mean function that gives appropriate feedback when the input is not numeric or contains nonpositive values. [10]
RUnit
or testthat
, write some tests for your harmonic mean function. You should check that the harmonic mean of 1, 2, and 4 equals 12 / 7
; that passing no inputs throws an error; that passing missing values behaves correctly; and that it behaves as you intended for nonnumeric and nonpositive inputs. Keep testing until all the tests pass! [15]
harmonic
. Now write an S3 print
method for this class that displays the message “The harmonic mean is y,” where y
is the harmonic mean. [10]
[61] OK, connecting to a file isn’t wrestling an angry bear, but it’s high-risk in programming terms.
[62] Don’t underestimate the importance of pretty code. You’ll spend more time reading code than writing it.
[63] Space shuttle software was reputed to contain just one bug in 420,000 lines of code, but that level of formal development methodology, code peer-reviewing, and extensive testing doesn’t come cheap.
[64] As Tobias Verbeke of Open Analytics once quipped, "debugonce
is a very optimistic function. I think debugtwice
might have been better.”
[65] If you’ve just recoiled in horror at the phrase “pen and paper calculations,” congratulations! You are well on your way to becoming an R user.
[66] In case you were wondering, NA_real_
is a missing number. Usually for missing values we just use NA
and let R figure out the type that it needs to be, but in this case, because we specified that the fields must be numeric, we need to explicitly state the type.