Chapter 10

Debugging Your Code

In This Chapter

arrow Discovering what warnings tell you

arrow Reading errors correctly

arrow Finding the bugs in your code

arrow Optimizing your debugging strategies

To err is human, and programmers fall into that “human” category as well (even though we like to believe otherwise!). Nobody manages to write code without errors, so instead of wondering if you have errors in your code, you should ask yourself where you have errors in your code. In this chapter, you discover some general strategies and specific tools to find that out.

Knowing What to Look For

A bug is simply another word for some kind of error in your program. So, debugging doesn’t involve insecticides — it just means getting rid of all type of semantic and/or logical errors in your functions.

Before you start hunting down bugs, you have to know what you’re looking for. In general, you can divide errors in your code into three different categories:

check.png Syntax errors: If you write code that R can’t understand, you have syntax errors. Syntax errors always result in an error message and often are caused by misspelling a function or forgetting a bracket.

check.png Semantic errors: If you write correct code that doesn’t do what you think it does, you have a semantic error. The code itself is correct, but the outcome of that line of code is not. It may, for example, return another type of object than you expect. If you use that object further on, it won’t be the type you think it is and your code will fail there.

check.png Logic errors: Probably the hardest-to-find are errors in the logic of your code. Your code works, it doesn’t generate any error or warning, but it still doesn’t return the result you expect. The mistake is not in the code itself, but in the logic it executes.

This may seem like a small detail, but finding different types of bugs requires different strategies. Often, you can easily locate a syntax error by simply reading the error messages, but semantic errors pose a whole different challenge and logic errors can hide in your code without your being aware they exist.

Reading Errors and Warnings

If something goes wrong with your code, R tells you. That can happen in two ways:

check.png The code keeps on running until the end, and when the code is finished, R prints out a warning message.

check.png The code stops immediately because R can’t carry it out, and R prints out an error message.

We have to admit it: These error messages can range from mildly confusing to completely incomprehensible if you’re not used to them. But it doesn’t have to stay that way. When you get familiar with the error and warning messages from R, you can quickly tell what’s wrong.

Reading error messages

Let’s take a look at such an error message. If you try the following code, you get this more or less clear error message:

> “a” + 1

Error in “a” + 1 : non-numeric argument to binary operator

You get two bits of information in this error message. First, the line “a” + 1 tells you in which line of code you have an error. Then it tells you what the error is. In this case, you used a non-numeric argument (the character ‘a’): In combination with a binary operator (the + sign).

remember.eps R always tells you in which code the error occurs, so you know in many cases where you have to start looking.

Error messages aren’t always that clear. Take a look at the following example:

> data.frame(1:10,10:1,)

Error in data.frame(1:10, 10:1, ) : argument is missing, with no default

To what argument does this error refer? Actually, it refers to an empty argument you provided for the function. After the second vector, there’s a comma that shouldn’t be there. A small typing error, but R expects another argument after that comma and doesn’t find one.

tip.eps If you don’t immediately understand an error message, take a closer look at the things the error message is talking about. Chances are, you just typed something wrong there.

Caring about warnings (or not)

You can’t get around errors, because they just stop your code. Warnings on the other hand are a whole different beast. Even if R throws a warning, it continues to execute the code regardless. So, you can ignore warnings, but in general that’s a pretty bad idea. Warnings often are the only sign you have that your code has some semantic or logic error.

For example, you could’ve forgotten about the ifelse() function discussed in Chapter 9 and tried something like the following example:

> x <- 1:10

> y <- if (x < 5 ) 0 else 1

Warning message:

In if (x < 5) 0 else 1 :

  the condition has length > 1 and only the first element will be used

This warning points at a semantic error: if expects a single TRUE or FALSE value, but you provided a whole vector. Note that, just like errors, warnings tell you in general which code has generated the warning.

Here is another warning that pops up regularly and may point to a semantic or logic error in your code:

> x <- 4

> sqrt(x - 5)

[1] NaN

Warning message:

In sqrt(x - 5) : NaNs produced

Because x - 5 is negative when x is 4, R cannot calculate the square root and warns you that the square root of a negative number is not a number (NaN).

technicalstuff.eps If you’re a mathematician, you may point out that the square root of –1 is 0 - 1i. R can, in fact, do calculations on complex numbers, but then you have to define your variables as complex numbers. You can check, for example, the Help file ?complex for more information.

Although most warnings result from either semantic or logic errors in your code, even a simple syntax error can generate a warning instead of an error. If you want to plot some points in R, you use the plot() function, as shown in Chapter 16. It takes an argument col to specify the color of the points, but you could mistakenly try to color the points using the following:

> plot(1:10, 10:1, color=’green’)

If you try this, you get six warning messages at once, all telling you that color is probably not the argument name you were looking for:

Warning messages:

1: In plot.window(...) : “color” is not a graphical parameter

2: In plot.xy(xy, type, ...) : “color” is not a graphical parameter

....

Notice that the warning messages don’t point toward the code you typed at the command line; instead, they point to functions you never used before, like plot.window() and plot.xy(). Remember: You can pass arguments from one function to another using the dots argument (see Chapter 8). That’s exactly what plot() does here. So, plot() itself doesn’t generate a warning, but every function that plot() passes the color argument to does.

tip.eps If you get warning or error messages, a thorough look at the Help pages of the function(s) that generated the error can help in determining what the reason is for the message you got. For example, at the Help page of ?plot.xy, you find that the correct name for the argument is col.

So, to summarize, most warnings point to one of the following problems:

check.png The function gave you a result, but for some reason that result may not be correct.

check.png The function generated an atypical outcome, like NA or NaN values.

check.png The function couldn’t deal with some of the arguments and ignored them.

Only the last one tells you there’s a problem with your syntax. For the other ones, you have to examine your code a bit more.

Going Bug Hunting

Although the error message always tells you which line of code generates the error, it may not be the line of code where things started going wrong. This makes bug hunting a complex business, but some simple strategies can help you track down these pesky creatures.

Calculating the logit

To illustrate some bug-hunting strategies in R, we use a simple example. Say, for example, your colleague wrote two functions to calculate the logit from both proportions and percentages, but he can’t get them to work. So, he asks you to help find the bugs. Here’s the code he sends you:

# checks input and does logit calculation

logit <- function(x){

  x <- ifelse(x < 0 | x > 1, “NA”, x)

  log(x / (1 - x) )

}

# transforms percentage to number and calls logit

logitpercent <- function(x){

  x <- gsub(“%”, “”, x)

  logit(as.numeric(x))

}

Copy and paste this code into the editor, and save the file using, for example, logitfunc.R as its name. After that, source the file in R from the editor using either the source() function or the source button or command from the editor of your choice. Now the function code is loaded in R, and you’re ready to start hunting.

technicalstuff.eps The logit is nothing else but the logarithm of the odds, calculated as log(x / (1-x)) if x is the probability of some event taking place. Statisticians use this when modeling binary data using generalized linear models. If you ever need to calculate a logit yourself, you can use the function qlogis() for that. To calculate probabilities from logit values, you use the plogis() function.

Knowing where an error comes from

Your colleague complained that he got an error when trying the following code:

> logitpercent(‘50%’)

Error in 1 - x : non-numeric argument to binary operator

Sure enough, but you don’t find the code 1 - x in the body of logit percent(). So, the error comes from somewhere else. To know from where, you can use the traceback() function immediately after the error occurred, like this:

> traceback()

2: logit(as.numeric(x)) at logitfunc.R#9

1: logitpercent(“50%”)

remember.eps This traceback() function prints what is called the call stack that lead to the last error. This call stack represents the sequence of function calls, but in reverse order. The function at the top is the function in which the actual error is generated.

In this example, R called the logitpercent() function, and that function, in turn, called logit(). The traceback tells you that the error occurred inside the logit() function. Even more, the traceback() function tells you that the error occurred in line 9 of the logitfunc.R code file, as indicated by logitfunc.R#9 in the traceback() output.

tip.eps The call stack gives you a whole lot of information — sometimes too much. It may point to some obscure internal function as the one that threw the error. If that function doesn’t ring a bell, check higher in the call stack for a function you recognize and start debugging from there.

Looking inside a function

Now that you know where the error came from, you can try to find out how the error came about. If you look at the code, you expect that the as.numeric() function in logitpercent() sends a numeric value to the logit() function. So, you want to check what’s going on in there.

In ancient times, programmers debugged a function simply by letting it print out the value of variables they were interested in. You can do the same by inserting a few print() statements in the logit() function. This way, you can’t examine the object though, and you have to add and delete print statements at every point where you want to peek inside the function. Luckily, we’ve passed the age of the dinosaurs; R gives you better methods to check what’s going on.

Telling R which function to debug

You can step through a function after you tell R you want to debug it using the debug() function, like this:

> debug(logit)

From now on, R will switch to the browser mode every time that function is called from anywhere in R, until you tell R explicitly to stop debugging or until you overwrite the function by sourcing it again. To stop debugging a function, you simply use undebug(logit).

tip.eps If you want to step through a function only once, you can use the function debugonce() instead of debug(). R will go to browser mode the next time the function is called, and only that time — so you don’t need to use undebug() to stop debugging.

If you try the function logitpercent() again after running the code debug(logit), you see the following:

> logitpercent(‘50%’)

debugging in: logit(as.numeric(x))

debug at D:/RForDummies/Ch10/logitfunc.R#2: {

    x <- ifelse(x < 0 | x > 1, “NA”, x)

    log(x/(1 - x))

}

Browse[2]>

You see that the prompt changed. It now says Browse[2]. This prompt tells you that you’re browsing inside a function.

technicalstuff.eps The number indicates at which level of the call stack you’re browsing at that moment. Remember from the output of the traceback() function that the logit() function occurred as the second function on the call stack. That’s the number 2 in the output above.

The additional text above the changed prompt gives you the following information:

check.png The line from where you called the function — in this case, the line logit(as.numeric(x)) from the logitpercent() function

check.png The file or function that you debug — in this case, the file logitfunc.R, starting at the second line

check.png Part of the code you’re about to browse through

Stepping through the function

When you’re in browser mode, you can use any R code you want in order to check the state of different objects. You can browse through the function now with the following commands:

check.png To run the next line of code, type n and press Enter. R enters the step-through mode. To run the subsequent lines of code line by line, you don’t have to type n anymore (although you still can). Just pressing Enter suffices.

check.png To run the remaining part of the code, type c and press Enter.

check.png To exit browser mode, type Q and press Enter.

tip.eps If you want to look at an object that’s named like any of the special browse commands, you have to specifically print it out, using either print(n) or str(n).

You can try it out yourself now. Just type n in the console, press Enter, and you see the following:

Browse[2]> n

debug at D:/RForDummies/Ch10 /logitfunc.R#3: x <- ifelse(x < 0 | x > 1, “NA”, x)

R now tells you what line it will run next. Because this is the first line in your code, x still has the value that was passed by the logitpercent() function.

tip.eps It’s always smart to check whether that value is what you expect it to be. The logitpercent() function should pass the value 0.50 to logit(), because this is the translation of 50 percent into a proportion. However, if you look at the value of x, you see the following:

Browse[2]> str(x)

num 50

Okay, it is a number, but it’s 100 times larger than it should be. So, in the logitpercent() function, your colleague made a logical error and forgot to divide by 100. If you correct that in the editor window and then save and source the file again, the test command gives the correct answer:

> logitpercent(‘50%’)

[1] 0

Start browsing from within the function

This still doesn’t explain the error. Your colleague intended to return NA if the number wasn’t between 0 and 1, but the function doesn’t do that. It’s the ifelse() line in the code where the number is checked, so you know where the problem lies.

You can easily browse through the logit() function until you reach that point, but when your function is larger, that task can become tedious. R allows you to start the browser at a specific point in your code if you insert a browser() statement at that point. For example, to start the browser mode right after the ifelse() line, you change the body of the logit() function, as in the following code, and source it again:

logit <- function(x){

  x <- ifelse(x < 0 | x > 1, “NA”, x)

  browser()

  log(x / (1 - x) )

}

remember.eps By sourcing the same function again, you implicitly stop debugging the function. That’s why you don’t have to un-debug the function explicitly using undebug(logit).

If you now try to run this function again, you see the following:

> logit(50)

Called from: logit(50)

Browse[1]>

You get less information than you do when you use debug(), but you can use the browser mode in exactly the same way as with debug().

tip.eps You can put a browser() statement inside a loop as well. If you use the command c to run the rest of the code, in this case, R will carry out the function only until the next round in the loop. This way, you can step through the loops of a function.

As you entered the function after the ifelse() line, R carried out that code already, so the value of x should be changed to NA. But if you check the value of x now, you see this:

Browse[2]> str(x)

chr “NA”

Running the next line finally gives the error. Indeed, your colleague made a semantic error here: He wanted to return NA if the value of x wasn’t between 0 and 1, but he accidentally quoted the NA and that makes it a character vector. The code doesn’t have any syntax error, but it’s still not correct.

warning_bomb.eps If you use browser() in your code, don’t forget to delete it afterward. Otherwise, your function will continue to switch to browse mode every time you use it. And don’t think you won’t forget — even experienced R programmers forget this all the time!

Generating Your Own Messages

Generating your own messages may sound strange in a chapter about debugging, but you can prevent bugs by actually generating your own errors. Remember the logic error in the logitpercent() function? It would’ve been easier to spot if the logit() function returned an error saying that you passed a number greater than 1.

remember.eps Adding sensible error (or warning) messages to a function can help debugging future functions where you call that specific function again. It especially helps in finding semantic or logic errors that are otherwise hard to find.

Creating errors

You can tell R to throw an error by inserting the stop() function anywhere in the body of the function, as in the following example:

logit <- function(x){

  if( any(x < 0 | x > 1) ) stop(‘x not between 0 and 1’)

  log(x / (1 - x) )

}

With the if() statement, you test whether any value in x lies between 0 and 1. Using the any() function around the condition allows your code to work with complete vectors at once, instead of with single values. Because the log() function works vectorized as well, the whole function is now vectorized (see Chapter 4).

If you change the body of the logit() function this way and try to calculate the logit of 50% and 150% (or 0.5 and 1.5), R throws an error like the following:

> logitpercent(c(‘50%’,’150%’))

Error in logit(as.numeric(x)/100) : x not between 0 and 1

As the name implies, the execution of the code stops anytime the stop() function is actually carried out; hence, it doesn’t return a result.

Creating warnings

Your colleague didn’t intend for the logit() function to stop, though, when some input values were wrong — he just wanted the function to return NA for those values. So, you also could make the function generate a warning instead of an error. That way you still get the same information, but the complete function is carried out so you get a result as well.

To generate a warning, use the warning() function instead of the stop() function. So, to get the result your colleague wants, you simply change the body of the function to the following code:

  x <- ifelse(x < 0 | x > 1, NA, x )

  if( any(is.na(x)) ) warning(‘x not between 0 and 1’)

  log(x / (1 - x) )

If you try the function now, you get the desired result:

> logitpercent(c(‘50%’,’150%’))

[1]  0 NA

Warning message:

In logit(as.numeric(x)/100) : x not between 0 and 1

Not only does the function return NA when it should, but it also gives you a warning that can help with debugging other functions that use the logit() function somewhere in the body.

Recognizing the Mistakes You’re Sure to Make

Despite all the debugging tools you have at your disposal, you need some experience to quickly find pesky bugs. But some mistakes are fairly common, and checking whether you made any of these gives you a big chance of pinpointing the error easily. Some of these mistakes come from default behavior of R you didn’t take into account; others are just the result of wool gathering. But every R programmer has made these mistakes at one point, and so will you.

Starting with the wrong data

Probably the most common mistakes in R are made while reading in data from text files using read.table() or read.csv(), as you do in Chapter 12. Many mistakes result in R throwing errors, but sometimes you only notice something went wrong when you look at the structure of your data. In the latter case you often find that some or all variables are converted to factors when they really shouldn’t be (for example, because they should contain only numerical data).

When R gives errors or the structure of your data isn’t what you think it should be, check the following:

check.png Did you forget to specify the argument header=TRUE? If so, R will see the column names as values and, as a result, convert every variable to a factor as it always does with character data in a text file.

check.png Did you have spaces in your column names or data? The read.table() function can interpret spaces in, for example, column names or in string data as a separator. You then get errors telling you ‘line x did not have y elements’.

check.png Did you have a different decimal separator? In some countries, decimals are separated by a comma. You have to specifically tell R that’s the case by using the argument dec=”,” in the read.table() function.

check.png Did you forget to specify stringsAsFactors = FALSE? By default, R changes character data to factors, so you always have to add this argument if you want your data to remain character variables.

check.png Did you have another way of specifying missing values? R reads ‘NA’ in a text file as a missing value, but the file may use a different code (for example, ‘missing’). R will see that as text and again convert that variable to a factor. You solve this by specifying the argument na.strings in the read.table() function.

tip.eps If you always check the structure of your data immediately after you read it in, you can catch errors much earlier and avoid hours of frustration. Your best bet is to use str() for information on the types and head() to see if the values are what you expected.

Having your data in the wrong format

As we’ve stressed multiple times, every function in R expects your data to be in a specific format. That doesn’t mean simply whether it’s an integer, character, or factor, but also whether you supply a vector, a matrix, a data frame, or a list. Many functions can deal with multiple formats, but sometimes the result isn’t what you expect at all.

technicalstuff.eps In fact, functions are generic functions that dispatch to a method for the object you supplied as an argument. (See Chapter 8 for more information on dispatching.)

Dropping dimensions when you don’t expect it

This mistake is definitely another classic. R automatically tries to reduce the number of dimensions when subsetting a matrix, array, or data frame (see Chapter 7). If you want to calculate the row sums of the numeric variables in a data frame — for example, the built-in data frame sleep — you can write a little function like this:

rowsum.df <- function(x){

  id <- sapply(x,is.numeric)

  rowSums(x[, id])

}

If you try that out on two built-in data frames, pressure and sleep, you get a result for the first one but the following error message for the second:

> rowsum.df(sleep)

Error in rowSums(x[, id]) :

  ‘x’ must be an array of at least two dimensions

Because sleep contains only a single numeric variable, x[, id] returns a vector instead of a data frame, and that causes the error in rowSums().

tip.eps You can solve this problem either by adding drop=FALSE (as shown in Chapter 7) or by using the list subsetting method x[i] instead.

Messing up with lists

Although lists help with keeping data together and come in very handy when you’re processing multiple datasets, they can cause some trouble as well.

First, you can easily forget that some function returns a list instead of a vector. For example, many programmers forget that strsplit() returns a list instead of a vector. So, if you want the second word from a sentence, the following code doesn’t return an error, but it doesn’t give you the right answer either:

> strsplit(‘this is a sentence’,’ ‘)[2]

[[1]]

NULL

In this example, strsplit() returns a list with one element, the vector with the words from the sentence:

> strsplit(‘this is a sentence’,’ ‘)

[[1]]

[1] “this”     “is”       “a”        “sentence”

To access this vector, you first have to select the wanted element from the list. Only then can you look for the second value using the vector indices, like this:

> strsplit(‘this is a sentence’,’ ‘)[[1]][2]

[1] “is”

Even the indexing mechanism itself can cause errors of this kind. For example, you have some names of customers and you want to add a dot between their first and last names. So, first, you split them like this:

> customer <- c(‘Johan Delong’,’Marie Petit’)

> namesplit <- strsplit(customer,’ ‘)

You want to paste the second name together with a dot in between, so you need to select the second element from the list. If you use single brackets, you get the following:

> paste(namesplit[2],collapse=’.’)

[1] “c(”Marie”, ”Petit”)”

That isn’t what you want at all. Remember from Chapter 7 that you can use both single brackets and double brackets to select elements from a list, but when you use single brackets, you always get a list returned. So, to get the correct result, you need double brackets, like this:

> paste(namesplit[[2]],collapse=’.’)

[1] “Marie.Petit”

tip.eps Notice that R never gave a sign — not even a warning — that something was wrong. So, if you notice lists where you wouldn’t expect them (or don’t notice them where you do expect them), check your brackets.

Mixing up factors and numeric vectors

If you work with factors that have numeric values as levels, you have to be extra careful when using these factors in models and other calculations. For example, you convert the number of cylinders in the dataset mtcars to a factor like this:

> cyl.factor <- as.factor(mtcars$cyl)

If you want to know the median number of cylinders, you may be tempted to do the following:

> median(as.numeric(cyl.factor))

[1] 2

This result is bogus, because the minimum number of cylinders is four. R converts the internal representation of the factor to numbers, not the labels. So, you get numbers starting from one to the number of levels instead of the original values.

tip.eps To correctly transform a factor its original numeric values, you can first transform the factor to character and then to numeric, as shown in Chapter 5. But on very big data, this is done faster with the following construct:

> as.numeric(levels(cyl.factor))[cyl.factor]

With this code, you create a short vector with the levels as numeric values, and then use the internal integer representation of the factor to select the correct value.

technicalstuff.eps Although R often converts a numeric vector to a factor automatically when necessary, it doesn’t do so if both numeric vectors and factors can be used. If you want to model, for example, the mileage of a car to the number of cylinders, you get a different model when you use the number of cylinders as a numeric vector or as a factor. The interpretation of both models is completely different, and a lot depends on what exactly you want to do. But you have to be aware of that, or you may be interpreting the wrong model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset