Chapter 16. Programming

Writing data analysis code is hard. This chapter is about what happens when things go wrong, and how to avoid that happening in the first place. We start with the different types of feedback that you can give to indicate a problem, working up to errors. Then we look at how to handle those errors when they are thrown, and how to debug code to eliminate the bad errors. A look at unit testing frameworks gives you the skills to avoid writing buggy code.

Next, we see some magic tricks: converting strings into code and code into strings (“Just like that!” as Tommy Cooper used to say). The chapter concludes with an introduction to some of the object-oriented programming systems in R.

Chapter Goals

After reading this chapter, you should:

  • Know how to provide feedback to the user through messages, warnings, and errors
  • Be able to gracefully handle errors
  • Understand a few techniques for debugging code
  • Be able to use the RUnit and testthat unit testing frameworks
  • Know how to convert strings to R expressions and back again
  • Understand the basics of the S3 and reference class object-oriented programming systems

Messages, Warnings, and Errors

We’ve seen the print function on many occasions for displaying variables to the console. For displaying diagnostic information about the state of the program, R has three functions. In increasing order of severity, they are message, warning, and stop.

message concatenates its inputs without spaces and writes them to the console. Some common uses are providing status updates for long-running functions, notifying users of new behavior when you’ve changed a function, and providing information on default arguments:

f <- function(x)
{
  message("'x' contains ", toString(x))
  x
}
f(letters[1:5])
## 'x' contains a, b, c, d, e
## [1] "a" "b" "c" "d" "e"

The main advantage of using message over print (or the lower-level cat) is that the user can turn off their display. It may seem trivial, but when you are repeatedly running the same code, not seeing the same message 100 times can have a wonderful effect on morale:

suppressMessages(f(letters[1:5]))
## [1] "a" "b" "c" "d" "e"

Warnings behave very similarly to messages, but have a few extra features to reflect their status as indicators of bad news. Warnings should be used when something has gone wrong, but not so wrong that your code should just give up. Common use cases are bad user inputs, poor numerical accuracy, or unexpected side effects:

g <- function(x)
{
  if(any(x < 0))
  {
    warning("'x' contains negative values: ", toString(x[x < 0]))
  }
  x
}
g(c(3, -7, 2, -9))
## Warning: 'x' contains negative values: -7, -9
## [1]  3 -7  2 -9

As with messages, warnings can be suppressed:

suppressWarnings(g(c(3, -7, 2, -9)))
## [1]  3 -7  2 -9

There is a global option, warn, that determines how warnings are handled. By default warn takes the value 0, which means that warnings are displayed when your code has finished running.

You can see the current level of the warn option using getOption:

getOption("warn")
## [1] 1

If you change this value to be less than zero, all warnings are ignored:

old_ops <- options(warn = -1)
g(c(3, -7, 2, -9))
## [1]  3 -7  2 -9

It is usually dangerous to completely turn off warnings, though, so you should reset the options to their previous state using:

options(old_ops)

Setting warn to 1 means that warnings are displayed as they occur, and a warn value of 2 or more means that all warnings are turned into errors.

You can access the last warning by typing last.warning.

I mentioned earlier that if the warn option is set to 0, then warnings are shown when your code finishes running. Actually, it’s a little more complicated than that. If 10 or fewer warnings were generated, then this is what happens. But if there were more than 10 warnings, you get a message stating how many warnings were generated, and you have to type warnings() to see them. This is demonstrated in Figure 16-1.

Where there are more than 10 warnings

Figure 16-1. Where there are more than 10 warnings, use warnings to see them

Errors are the most serious condition, and throwing them halts further execution. Errors should be used when a mistake has occurred or you know a mistake will occur. Common reasons include bad user input that can’t be corrected (by using an as.* function, for example), the inability to read from or write to a file, or a severe numerical error:

h <- function(x, na.rm = FALSE)
{
  if(!na.rm && any(is.na(x)))
  {
    stop("'x' has missing values.")
  }
  x
}
h(c(1, NA))
## Error: 'x' has missing values.

stopifnot throws an error if any of the expressions passed to it evaluate to something that isn’t true. It provides a simple way of checking that the state of your program is as expected:

h <- function(x, na.rm = FALSE)
{
  if(!na.rm)
  {
    stopifnot(!any(is.na(x)))
  }
  x
}
h(c(1, NA))
## Error: !any(is.na(x)) is not TRUE

For a more extensive set of human-friendly tests, use the assertive package:

library(assertive)
h <- function(x, na.rm = FALSE)
{
  if(!na.rm)
  {
    assert_all_are_not_na(x)
  }
  x
}
h(c(1, NA))
## Error: x contains NAs.

Error Handling

Some tasks are inherently risky. Reading from and writing to files or databases is notoriously error prone, since you don’t have complete control over the filesystem, or the network or database. In fact, any time that R interacts with other software (Java code via rJava, WinBUGS via R2WinBUGS, or any of the hundreds of other pieces of software that R can connect to), there is an inherent risk that something will go wrong.

For these dangerous tasks,[61] you need to decide what to do when problems occur. Sometimes it isn’t useful to stop execution when an error is thrown. For example, if you are looping over files importing them, then if one import fails you don’t want to just stop executing and lose all the data that you’ve successfully imported already.

In fact, this point generalizes: any time you are doing something risky in a loop, you don’t want to discard your progress if one of the iterations fails. In this next example, we try to convert each element of a list into a data frame:

to_convert <- list(
  first  = sapply(letters[1:5], charToRaw),
  second = polyroot(c(1, 0, 0, 0, 1)),
  third  = list(x = 1:2, y = 3:5)
)

If we run the code nakedly, it fails:

lapply(to_convert, as.data.frame)
## Error: arguments imply differing number of rows: 2, 3

Oops! The third element fails to convert because of differing element lengths, and we lose everything.

The simplest way of protecting against total failure is to wrap the failure-prone code inside a call to the try function:

result <- try(lapply(to_convert, as.data.frame))

Now, although the error will be printed to the console (you can suppress this by passing silent = TRUE), execution of code won’t stop.

If the code passed to a try function executes successfully (without throwing an error), then result will just be the result of the calculation, as usual. If the code fails, then result will be an object of class try-error. This means that after you’ve written a line of code that includes try, the next line should always look something like this:

if(inherits(result, "try-error"))
{
  #special error handling code
} else
{
  #code for normal execution
}
## NULL

Since you have to include this extra line every time, code using the try function is a bit ugly. A prettier alternative[62] is to use tryCatch. tryCatch takes an expression to safely run, just as try does, but also has error handling built into it.

To handle an error, you pass a function to an argument named error. This error argument accepts an error (technically, an object of class simpleError) and lets you manipulate, print, or ignore it as you see fit. If this sounds complicated, don’t worry: it’s easier in practice. In this next example, when an error is thrown, we print the error message and return an empty data frame:

tryCatch(
  lapply(to_convert, as.data.frame),
  error = function(e)
  {
    message("An error was thrown: ", e$message)
    data.frame()
  }
)
## An error was thrown: arguments imply differing number of rows: 2, 3
## data frame with 0 columns and 0 rows

tryCatch has one more trick: you can pass an expression to an argument named finally, which runs whether an error was thrown or not (just like the on.exit function we saw when we were connecting to databases).

Despite having played with try and tryCatch, we still haven’t solved our problem: when looping over things, if an error is thrown, we want to keep the results of the iterations that worked.

To achieve this, we need to put try or tryCatch inside the loop:

lapply(
  to_convert,
  function(x)
  {
    tryCatch(
      as.data.frame(x),
      error = function(e) NULL
    )
  }
)
## $first
##    x
## a 61
## b 62
## c 63
## d 64
## e 65
##
## $second
##                 x
## 1  0.7071+0.7071i
## 2 -0.7071+0.7071i
## 3 -0.7071-0.7071i
## 4  0.7071-0.7071i
##
## $third
## NULL

Since this is a common piece of code, the plyr package contains a function, tryapply, that deals with exactly this case in a cleaner fashion:

tryapply(to_convert, as.data.frame)
## $first
##    x
## a 61
## b 62
## c 63
## d 64
## e 65
##
## $second
##                 x
## 1  0.7071+0.7071i
## 2 -0.7071+0.7071i
## 3 -0.7071-0.7071i
## 4  0.7071-0.7071i

Eagle-eyed observers may notice that the failures are simply removed in this case.

Debugging

All nontrivial software contains errors.[63] When problems happen, you need to be able to find where they occur, and hopefully find a way to fix them. This is especially true if it’s your own code. If the problem occurs in the middle of a simple script, you usually have access to all the variables, so it is trivial to locate the problem.

More often than not, problems occur somewhere deep inside a function inside another function inside another function. In this case, you need a strategy to inspect the state of the program at each level of the call stack. (“Call stack” is just jargon for the list of functions that have been called to get you to this point in the code.)

When an error is thrown, the traceback function tells you where the last error occurred. First, let’s define some functions in which the error can occur:

outer_fn <- function(x) inner_fn(x)
inner_fn <- function(x) exp(x)

Now let’s call outer_fn (which then calls inner_fn) with a bad input:

outer_fn(list(1))
## Error: non-numeric argument to mathematical function

traceback now tells us the functions that we called before tragedy struck (see Figure 16-2).

Call stack using +traceback+

Figure 16-2. Call stack using traceback

In general, if it isn’t an obvious bug, we don’t know where in the call stack the problem occurred. One reasonable strategy is to start in the function where the error was thrown, and work our way up the stack if we need to. To do this, we need a way to stop execution of the code close to the point where the error was thrown. One way to do this is to add a call to the browser function just before the error point (we know where the error occurred because we used traceback):

inner_fn <- function(x)
{
  browser()     #execution pauses here
  exp(x)
}

browser halts execution when it is reached, giving us time to inspect the program. A really good idea in most cases is to call ls.str to see the values of all the variables that are in play at the time. In this case we see that x is a list, not a numeric vector, causing exp to fail.

An alternative strategy for spotting errors is to set the global error option. This strategy is preferable when the error lies inside someone else’s package, where it is harder to stick a call to browser. (You can alter functions inside installed packages using the fixInNamespace function. The changes persist until you close R.)

The error option accepts a function with no arguments, and is called whenever an error is thrown. As a simple example, we can set it to print a message after the error has occurred, as shown in Figure 16-3.

Overriding the global error option

Figure 16-3. Overriding the global error option

While a sympathetic message may provide a sliver of consolation for the error, it isn’t very helpful in terms of fixing the problem. A much more useful alternative is provided in the recover function that ships with R. recover lets you step into any function in the call stack after an error has been thrown (see Figure 16-4).

Call stack using +error = recover+

Figure 16-4. Call stack using error = recover

You can also step through a function line by line using the debug function. This is a bit boring with trivial single-line functions like inner and outer, so we’ll test it on a more substantial offering. buggy_count, included in the learningr package, is a buggy version of the count function from the plyr package that fails in an obscure way when you pass it a factor. Pressing Enter at the command line without typing anything lets us step through it until we find the problem:

debug(buggy_count)
x <- factor(sample(c("male", "female"), 20, replace = TRUE))
buggy_count(x)

count (and by extension, our buggy_count) accepts a data frame or a vector as its first argument. If the df argument is a vector, then the function inserts it into a data frame.

Figure 16-5 shows what happens when we reach this part of the code. When df is a factor, we want it to be placed inside a data frame. Unfortunately, is.vector returns FALSE for factors, and the step is ignored. Factors aren’t considered to be vectors, because they have attributes other than names. What the code really should contain (and does in the proper version of plyr) is a call to is.atomic, which is TRUE for factors as well as other vector types, like numeric.

Debugging the buggy_count function.

Figure 16-5. Debugging the buggy_count function

To exit the debugger, type Q at the command line. With the debug function, the debugger will be started every time that function is called. To turn off debugging, call undebug:

undebug(buggy_count)

As an alternative, use debugonce, which only calls the debugger the first time a function is called.[64]

Testing

To make sure that your code isn’t buggy and awful, it is important to test it. Unit testing is the concept of testing small chunks of code; in R this means testing at the functional level. (System or integration testing is the larger-scale testing of whole pieces of software, but that is more useful for application development than data analysis.)

Each time you change a function, you can break code in other functions that rely on it. This means that each time you change a function, you need to test everything that it could affect. Attempted manually, this is impossible, or at least time-consuming and boring enough that you won’t do it. Consequently, you need to automate the task. In R, you have two choices for this:

  1. RUnit has “xUnit” syntax, meaning that it’s very similar to Java’s JUnit, .NET’s NUnit, Python’s PyUnit, and a whole other family of unit testing suites. This makes it easiest to learn if you’ve done unit testing in any other language.
  2. testthat has its own syntax, and a few extra features. In particular, the caching of tests makes it much faster for large projects.

Let’s test the hypotenuse function we wrote when we first learned about functions in Functions. It uses the obvious algorithm that you might use for pen and paper calculations.[65] The function is included in the learningr package:

hypotenuse <- function(x, y)
{
  sqrt(x ^ 2 + y ^ 2)
}

RUnit

In RUnit, each test is a function that takes no inputs. Each test compares the actual result of running some code (in this case, calling hypotenuse) to an expected value, using one of the check* functions contained in the package. In this next example we use checkEqualsNumeric, since we are comparing two numbers:

library(RUnit)
test.hypotenuse.3_4.returns_5 <- function()
{
  expected <- 5
  actual <- hypotenuse(3, 4)
  checkEqualsNumeric(expected, actual)
}

Tip

There is no universal naming convention for tests, but RUnit looks for functions with names beginning with test by default. The convention used here is designed to maximize clarity. Tests take the name form of test.name_of_function.description_of_inputs.returns_a_value.

Sometimes we want to make sure that a function fails in the correct way. For example, we can test that hypotenuse fails if no inputs are provided:

test.hypotenuse.no_inputs.fails <- function()
{
  checkException(hypotenuse())
}

Many algorithms suffer loss of precision when given very small or very large inputs, so it is good practice to test those conditions. The smallest and largest positive numeric values that R can represent are given by the double.xmin and double.xmax components of the built-in .Machine constant:

.Machine$double.xmin
## [1] 2.225e-308
.Machine$double.xmax
## [1] 1.798e+308

For the small and large tests, we pick values close to these limits. In the case of small numbers, we need to manually tighten the tolerance of the test. By default, checkEqualsNumeric considers its test passed when the actual result is within about 1e-8 of the expected result (it uses absolute, not relative differences). We set this value to be a few orders of magnitude smaller than the inputs to make sure that the test fails appropriately:

test.hypotenuse.very_small_inputs.returns_small_positive <- function()
{
  expected <- sqrt(2) * 1e-300
  actual <- hypotenuse(1e-300, 1e-300)
  checkEqualsNumeric(expected, actual, tolerance = 1e-305)
}
test.hypotenuse.very_large_inputs.returns_large_finite <- function()
{
  expected <- sqrt(2) * 1e300
  actual <- hypotenuse(1e300, 1e300)
  checkEqualsNumeric(expected, actual)
}

There are countless more possible tests; for example, what happens if we pass missing values or NULL or infinite values or character values or vectors or matrices or data frames, or we expect an answer in non-Euclidean space? Thorough testing works your imagination hard. Unleash your inner two-year-old and contemplate breaking stuff. On this occasion, we’ll stop here. Save all your tests into a file; RUnit defaults to looking for files with names that begin with “runit” and have a .R file extension. These tests can be found in the tests directory of the learningr package.

Now that we have some tests, we need to run them. This is a two-step process.

First, we define a test suite with defineTestSuite. This function takes a string for a name (used in its output), and a path to the directory where your tests are contained. If you’ve named your test functions or files in a nonstandard way, you can provide a pattern to identify them:

test_dir <- system.file("tests", package = "learningr")
suite <- defineTestSuite("hypotenuse suite", test_dir)

The second step is to run them with runTestSuite (additional line breaks have been added here as needed, to fit the formatting of the book):

runTestSuite(suite)
##
##
## Executing test function test.hypotenuse.3_4.returns_5  ...
## done successfully.
##
##
##
## Executing test function test.hypotenuse.no_inputs.fails  ...
## done successfully.
##
##
##
## Executing test function
## test.hypotenuse.very_large_inputs.returns_large_finite  ...
## Timing stopped at: 0 0 0 done successfully.
##
##
##
## Executing test function
## test.hypotenuse.very_small_inputs.returns_small_positive  ...
## Timing stopped at: 0 0 0 done successfully.
## Number of test functions: 4
## Number of errors: 0
## Number of failures: 2

This runs each test that it finds and displays whether it passed, failed, or threw an error. In this case, you can see that the small and large input tests failed. So what went wrong?

The problem with our algorithm is that we have to square each input. Squaring big numbers makes them larger than the largest (double-precision) number that R can represent, so the result comes back as infinity. Squaring very small numbers makes them even smaller, so that R thinks they are zero. (There are better algorithms that avoid this problem; see the ?hypotenuse help page for links to a discussion of better algorithms for real-world use.)

RUnit has no built-in checkWarning function to test for warnings. To test that a warning has been thrown, we need a trick: we set the warn option to 2 so that warnings become errors, and then restore it to its original value when the test function exits using on.exit. Recall that code inside on.exit is run when a function exits, regardless of whether it completed successfully or an error was thrown:

test.log.minus1.throws_warning <- function()
{
  old_ops <- options(warn = 2) #warnings become errors
  on.exit(old_ops)             #restore old behavior
  checkException(log(-1))
}

testthat

Though testthat has a different syntax, the principles are almost the same. The main difference is that rather than each test being a function, it is a call to one of the expect_* functions in the package. For example, expect_equal is the equivalent of RUnit’s checkEqualsNumeric. The translated tests (also available in the tests directory of the learningr package) look like this:

library(testthat)
expect_equal(hypotenuse(3, 4), 5)
expect_error(hypotenuse())
expect_equal(hypotenuse(1e-300, 1e-300), sqrt(2) * 1e-300, tol = 1e-305)
expect_equal(hypotenuse(1e300, 1e300), sqrt(2) * 1e300)

To run this, we call test_file with the name of the file containing tests, or test_dir with the name of the directory containing the files containing the tests. Since we have only one file, we’ll use test_file:

filename <- system.file(
  "tests",
  "testthat_hypotenuse_tests.R",
  package = "learningr"
)
test_file(filename)
## ..12
##
## 1. Failure: (unknown) -----------------------------------------------------
## learningr::hypotenuse(1e-300, 1e-300) not equal to sqrt(2) * 1e-300
## Mean relative difference: 1
##
## 2. Failure: (unknown) -----------------------------------------------------
## learningr::hypotenuse(1e+300, 1e+300) not equal to sqrt(2) * 1e+300
## Mean relative difference: Inf

There are two variations for running the tests: test_that tests code that you type at the command line (or, more likely, copy and paste), and test_package runs all tests from a package, making it easier to test nonexported functions.

Unlike with RUnit, warnings can be tested for directly via expect_warning:

expect_warning(log(-1))

Magic

The source code that we write, as it exists in a text editor, is just a bunch of strings. When we run that code, R needs to interpret what those strings contain and perform the appropriate action. It does that by first turning the strings into one of several language variable types. And sometimes we want to do the opposite thing, converting language variables into strings.

Both these tasks are rather advanced, dark magic. As is the case with magic in every movie ever, if you use it without understanding what you are doing, you’ll inevitably suffer nasty, unexpected consequences. On the other hand, used knowledgeably and sparingly, there are some useful tricks that you can put up your sleeve.

Turning Strings into Code

Whenever you type a line of code at the command line, R has to turn that string into something it understands. Here’s a simple call to the arctangent function:

atan(c(-Inf, -1, 0, 1, Inf))
## [1] -1.5708 -0.7854  0.0000  0.7854  1.5708

We can see what happens to this line of code in slow motion by using the quote function. quote takes a function call like the one in the preceding line, and returns an object of class call, which represents an “unevaluated function call”:

(quoted_r_code <- quote(atan(c(-Inf, -1, 0, 1, Inf))))
## atan(c(-Inf, -1, 0, 1, Inf))
class(quoted_r_code)
## [1] "call"

The next step that R takes is to evaluate that call. We can mimic this step using the eval function:

eval(quoted_r_code)
## [1] -1.5708 -0.7854  0.0000  0.7854  1.5708

The general case, then, is that to execute code that you type, R does something like eval(quote(the stuff you typed at the command line)).

To understand the call type a little better, let’s convert it to a list:

as.list(quoted_r_code)
## [[1]]
## atan
##
## [[2]]
## c(-Inf, -1, 0, 1, Inf)

The first element is the function that was called, and any additional elements contain the arguments that were passed to it.

One important thing to remember is that in R, more or less everything is a function. That’s a slight exaggeration, but operators like +; language constructs like switch, if, and for; and assignment and indexing are functions:

vapply(
  list(`+`, `if`, `for`, `<-`, `[`, `[[`),
  is.function,
  logical(1)
)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE

The upshot of this is that anything that you type at the command line is really a function call, which is why this input is turned into call objects.

All of this was a long-winded way of saying that sometimes we want to take text that is R code and get R to execute it. In fact, we’ve already seen two functions that do exactly that for special cases: assign takes a string and assigns a value to a variable with that name, and its reverse, get, retrieves a variable based upon a string input.

Rather than just limiting ourselves to assigning and retrieving variables, we might occasionally decide that we want to take an arbitrary string of R code and execute it. You may have noticed that when we use the quote function, we just type the R code directly into it, without wrapping it in—ahem—quotes. If our input is a string (as in a character vector of length one), then we have a slightly different problem: we must “parse” the string. Naturally, this is done with the parse function.

parse returns an expression object rather than a call. Before you get frightened, note that an expression is basically just a list of calls.

Note

The exact nature of calls and expressions is deep, dark magic, and I don’t want to be responsible for the ensuing zombie apocalypse when you try to raise the dead using R. If you are interested in arcana, read Chapter 6 of the R Language Definition manual that ships with R.

When we call parse in this way, we must explicitly name the text argument:

parsed_r_code <- parse(text = "atan(c(-Inf, -1, 0, 1, Inf))")
class(parsed_r_code)
## [1] "expression"

Just as with the quoted R code, we use eval to evaluate it:

eval(parsed_r_code)
## [1] -1.5708 -0.7854  0.0000  0.7854  1.5708

Warning

This sort of mucking about with evaluating strings is a handy trick, but the resulting code is usually fragile and fiendish to debug, making your code unmaintainable. This is the zombie (code) apocalypse mentioned above.

Turning Code into Strings

There are a few occasions when we want to solve the opposite problem: turning code into a string. The most common reason for this is to use the name of a variable that was passed into a function. The base histogram-drawing function, hist, includes a default title that tells you the name of the data variable:

random_numbers <- rt(1000, 2)
hist(random_numbers)

To replicate this technique ourselves, we need two functions: substitute and deparse. substitute takes some code and returns a language object. That usually means a call, like we would have created using quote, but occasionally it’s a name object, which is a special type that holds variable names. (Don’t worry about the details, this section is called “Magic” for a reason.)

The next step is to turn this language object into a string. This is called deparsing. The technique can be very useful for providing helpful error messages when you check user inputs to functions. Let’s see the deparse-substitute combination in action:

divider <- function(numerator, denominator)
{
  if(denominator == 0)
  {
    denominator_name <- deparse(substitute(denominator))
    warning("The denominator, ", sQuote(denominator_name), ", is zero.")
  }
  numerator / denominator
}
top <- 3
bottom <- 0
divider(top, bottom)
## Warning: The denominator, 'bottom', is zero.
## [1] Inf

substitute has one more trick up its sleeve when used in conjunction with eval. eval lets you pass it an environment or a data frame, so you can tell R where to look to evaluate the expression.

As a simple example, we can use this trick to retrieve the levels of the Gender column of the hafu dataset:

eval(substitute(levels(Gender)), hafu)
## [1] "F" "M"

This is exactly how the with function works:

with(hafu, levels(Gender))
## [1] "F" "M"

In fact, many functions use the technique: subset uses it in several places, and lattice plots use the trick to parse their formulae. There are a few variations on the trick described in Thomas Lumley’s “Standard nonstandard evaluation rules.”

Object-Oriented Programming

Most R code that we’ve seen so far is functional-programming-inspired imperative programming. That is, functions are first-class objects, but we usually end up with a data-analysis script that executes one line at a time.

In a few circumstances, it is useful to use an object-oriented programming (OOP) style. This means that data is stored inside a class along with the functions that are allowed to act on it. It is an excellent tool for managing complexity in larger programs, and is particularly suited to GUI development (in R or elsewhere). See Michael Lawrence’s Programming Graphical User Interfaces in R for more on that topic.

R has six (count ‘em) different OOP systems, but don’t let that worry you—there are only two of them that you’ll need for new projects.

Three systems are built into R:

  1. S3 is a lightweight system for overloading functions (i.e., calling a different version of the function depending upon the type of input).
  2. S4 is a fully featured OOP system, but it’s clunky and tricky to debug. Only use it for legacy code.
  3. Reference classes are the modern replacement for S4 classes.

Three other systems are available in add-on packages (but for new code, you will usually want to use reference classes instead):

  1. proto is a lightweight wrapper around environments for prototype-based programming.
  2. R.oo extends S3 into a fully fledged OOP system.
  3. OOP is a precursor to reference classes, now defunct.

Tip

In many object-oriented programming languages, functions are called methods. In R, the two words are interchangeable, but “method” is often used in an OOP context.

S3 Classes

Sometimes we want a function to behave differently depending upon the type of input. A classic example is the print function, which gives a different style of output for different variables. S3 lets us call a different function for printing different kinds of variables, without having to remember the names of each one.

The print function is very simple—just one line, in fact:

print
## function (x, ...)
## UseMethod("print")
## <bytecode: 0x0000000018fad228>
## <environment: namespace:base>

It takes an input, x (and ...; the ellipsis is necessary), and calls UseMethod("print"). UseMethod checks the class of x and looks for another function named print.class_of_x, calling it if it is found. If it can’t find a function of that name, it tries to call print.default.

For example, if we want to print a Date variable, then we can just type:

today <- Sys.Date()
print(today)
## [1] "2013-07-17"

print calls the Date-specific function print.Date:

print.Date
## function (x, max = NULL, ...)
## {
##     if (is.null(max))
##         max <- getOption("max.print", 9999L)
##     if (max < length(x)) {
##         print(format(x[seq_len(max)]), max = max, ...)
##         cat(" [ reached getOption("max.print") -- omitted",
##             length(x) - max, "entries ]
")
##     }
##     else print(format(x), max = max, ...)
##     invisible(x)
## }
## <bytecode: 0x0000000006dc19f0>
## <environment: namespace:base>

Inside print.Date, our date is converted to a character vector (via format), and then print is called again. There is no print.character function, so this time UseMethod delegates to print.default, at which point our date string appears in the console.

Warning

If a class-specific method can’t be found, and there is no default method, then an error is thrown.

You can see all the available methods for a function with the methods function. The print function has over 100 methods, so here we just show the first few:

head(methods(print))
## [1] "print.abbrev"      "print.acf"         "print.AES"
## [4] "print.anova"       "print.Anova"       "print.anova.loglm"
methods(mean)
## [1] mean.Date     mean.default  mean.difftime mean.POSIXct  mean.POSIXlt
## [6] mean.times*   mean.yearmon* mean.yearqtr* mean.zoo*
##
##    Non-visible functions are asterisked

Tip

If you use dots in your function names, like data.frame, then it can get confusing as to which S3 method gets called. For example, print.data.frame could mean a print.data method for a frame input, as well as the correct sense of a print method for a data.frame object. Consequently, using lower_under_case or lowerCamelCase is preferred for new function names.

Reference Classes

Reference classes are closer to a classical OOP system than S3 and S4, and should be moderately intuitive to anyone who has used classes in C++ or its derivatives.

Note

A class is the general template for how the variables should be structured. An object is a particular instance of the class. For example, 1:10 is an object of class numeric.

The setRefClass function creates the template for a class. In R terminology, it’s called a class generator. In some other languages, it would be called a class factory.

Let’s try to build a class for a 2D point as an example. A call to setRefClass looks like this:

my_class_generator <- setRefClass(
  "MyClass",
  fields = list(
    #data variables are defined here
  ),
  methods = list(
    #functions to operate on that data go here
    initialize = function(...)
    {
    #initialize is a special function called
    #when an object is created.
    }
  )
)

Our class needs x and y coordinates to store its location, and we want these to be numeric.

In the following example, we declare x and y to be numeric:

Note

If we didn’t care about the class of x and y, we could declare them with the special value ANY.

point_generator <- setRefClass(
  "point",
  fields = list(
    x = "numeric",
    y = "numeric"
  ),
  methods = list(
    #TODO
  )
)

This means that if we try to assign them values of another type, an error will be thrown. Purposely restricting user input may sound counterintuitive, but it can save you from having more obscure bugs further down the line.

Next we need to add an initialize method. This is called every time we create a point object. This method takes x and y input numbers and assigns them to our x and y fields. There are three interesting things to note about it:

  1. If the first line of a method is a string, then it is considered to be help text for that method.
  2. The global assignment operator, <<-, is used to assign to a field. Local assignment (using <-) just creates a local variable inside the method.
  3. It is best practice to let initialize work without being passed any arguments, since it makes inheritance easier, as we’ll see in a moment. This is why the x and y arguments have default values.[66]

With the initialize method, our class generator now looks like this:

point_generator <- setRefClass(
  "point",
  fields = list(
    x = "numeric",
    y = "numeric"
  ),
  methods = list(
    initialize = function(x = NA_real_, y = NA_real_)
    {
      "Assign x and y upon object creation."
      x <<- x
      y <<- y
    }
  )
)

Our point class generator is finished, so we can now create a point object. Every generator has a new method for this purpose. The new method calls initialize (if it exists) as part of the object creation process:

(a_point <- point_generator$new(5, 3))
## Reference class object of class "point"
## Field "x":
## [1] 5
## Field "y":
## [1] 3

Generators also have a help method that returns the help string for a method that you specify:

point_generator$help("initialize")
## Call:
## $initialize(x = , y = )
##
##
## Assign x and y upon object creation.

You can provide a more traditional interface to object-oriented code by wrapping class methods inside other functions. This can be useful if you want to distribute your code to other people without having to teach them about OOP:

create_point <- function(x, y)
{
  point_generator$new(x, y)
}

At the moment, the class isn’t very interesting because it doesn’t do anything. Let’s redefine it with some more methods:

point_generator <- setRefClass(
  "point",
  fields  = list(
    x = "numeric",
    y = "numeric"
  ),
  methods = list(
    initialize         = function(x = NA_real_, y = NA_real_)
    {
      "Assign x and y upon object creation."
      x <<- x
      y <<- y
    },
    distanceFromOrigin = function()
    {
      "Euclidean distance from the origin"
      sqrt(x ^ 2 + y ^ 2)
    },
    add                = function(point)
    {
      "Add another point to this point"
      x <<- x + point$x
      y <<- y + point$y
      .self
    }
  )
)

These additional methods belong to point objects, unlike new and help, which belong to the class generator (in OOP terminology, new and help are static methods):

a_point <- create_point(3, 4)
a_point$distanceFromOrigin()
## [1] 5
another_point <- create_point(4, 2)
(a_point$add(another_point))
## Reference class object of class "point"
## Field "x":
## [1] 7
## Field "y":
## [1] 6

As well as new and help, generator classes have a few more methods. fields and methods respectively list the fields and methods of that class, and lock makes a field read-only:

point_generator$fields()
##         x         y
## "numeric" "numeric"
point_generator$methods()
##  [1] "add"                "callSuper"          "copy"
##  [4] "distanceFromOrigin" "export"             "field"
##  [7] "getClass"           "getRefClass"        "import"
## [10] "initFields"         "initialize"         "show"
## [13] "trace"              "untrace"            "usingMethods"

Some other methods can be called either from the generator object or from instance objects. show prints the object, trace and untrace let you use the trace function on a method, export converts the object to another class type, and copy makes a copy.

Reference classes support inheritance, where classes can have children to extend their functionality. For example, we can create a three-dimensional point class that contains our original point class, but includes an extra z coordinate.

A class inherits fields and methods from another class by using the contains argument:

three_d_point_generator <- setRefClass(
  "three_d_point",
  fields   = list(
    z = "numeric"
  ),
  contains = "point",         #this line lets us inherit
  methods  = list(
    initialize = function(x, y, z)
    {
      "Assign x and y upon object creation."
      x <<- x
      y <<- y
      z <<- z
    }
  )
)
a_three_d_point <- three_d_point_generator$new(3, 4, 5)

At the moment, our distanceFromOrigin function is wrong, since it doesn’t take the z dimension into account:

a_three_d_point$distanceFromOrigin() #wrong!
## [1] 5

We need to override it in order for it to make sense in the new class. This is done by adding a method with the same name to the class generator:

three_d_point_generator <- setRefClass(
  "three_d_point",
  fields   = list(
    z = "numeric"
  ),
  contains = "point",
  methods  = list(
    initialize = function(x, y, z)
    {
      "Assign x and y upon object creation."
      x <<- x
      y <<- y
      z <<- z
    },
    distanceFromOrigin = function()
    {
      "Euclidean distance from the origin"
      sqrt(x ^ 2 + y ^ 2 + z ^ 2)
    }
  )
)

To use the updated definition, we need to recreate our point:

a_three_d_point <- three_d_point_generator$new(3, 4, 5)
a_three_d_point$distanceFromOrigin()
## [1] 7.071

Sometimes we want to use methods from the parent class (a.k.a. superclass). The callSuper method does exactly this, so we could have written our 3D distanceFromOrigin (inefficiently) like this:

distanceFromOrigin = function()
{
  "Euclidean distance from the origin"
  two_d_distance <- callSuper()
  sqrt(two_d_distance ^ 2 + z ^ 2)
}

OOP is a big topic, and even limited to reference classes, it’s worth a book in itself. John Chambers (creator of the S language, R Core member, and author of the reference classes code) is currently writing a book on OOP in R. Until that materializes, the ?ReferenceClasses help page is currently the definitive reference-class reference.

Summary

  • R has three levels of feedback about problems: messages, warnings, and errors.
  • Wrapping code in a call to try or tryCatch lets you control how you handle errors.
  • The debug function and its relatives help you debug functions.
  • The RUnit and testthat packages let you do unit testing.
  • R code consists of language objects known as calls and expressions.
  • You can turn strings into language objects and vice versa.
  • There are six different object-oriented programming systems in R, though only S3 and reference classes are needed for new projects.

Test Your Knowledge: Quiz

Question 16-1
If your code generates more than 10 warnings, which function would you use to view them?
Question 16-2
What is the class of the return value from a call to try if an error was thrown?
Question 16-3
To test that an error is correctly thrown in an RUnit test, you call the checkException function. What is the testthat equivalent?
Question 16-4
Which two functions do you need to mimic executing code typed at the command line?
Question 16-5
How do you make the print function do different things for different types of input?

Test Your Knowledge: Exercises

Exercise 16-1
The harmonic mean is defined as the reciprocal of the arithmetic mean of the reciprocal of the data, or 1 / mean(1 / x), where x contains positive numbers. Write a harmonic mean function that gives appropriate feedback when the input is not numeric or contains nonpositive values. [10]
Exercise 16-2
Using either RUnit or testthat, write some tests for your harmonic mean function. You should check that the harmonic mean of 1, 2, and 4 equals 12 / 7; that passing no inputs throws an error; that passing missing values behaves correctly; and that it behaves as you intended for nonnumeric and nonpositive inputs. Keep testing until all the tests pass! [15]
Exercise 16-3
Modify your harmonic mean function so that the return value has the class harmonic. Now write an S3 print method for this class that displays the message “The harmonic mean is y,” where y is the harmonic mean. [10]


[61] OK, connecting to a file isn’t wrestling an angry bear, but it’s high-risk in programming terms.

[62] Don’t underestimate the importance of pretty code. You’ll spend more time reading code than writing it.

[63] Space shuttle software was reputed to contain just one bug in 420,000 lines of code, but that level of formal development methodology, code peer-reviewing, and extensive testing doesn’t come cheap.

[64] As Tobias Verbeke of Open Analytics once quipped, "debugonce is a very optimistic function. I think debugtwice might have been better.”

[65] If you’ve just recoiled in horror at the phrase “pen and paper calculations,” congratulations! You are well on your way to becoming an R user.

[66] In case you were wondering, NA_real_ is a missing number. Usually for missing values we just use NA and let R figure out the type that it needs to be, but in this case, because we specified that the fields must be numeric, we need to explicitly state the type.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset