Hour 6. Common R Utility Functions


What You’ll Learn in This Hour:

Image Common functions for numeric data

Image How to simulate data in R

Image Simple logical summaries

Image Functions for missing data

Image Useful function for manipulating character data


So far you have seen how to create objects of different modes and how to work with special types of data—but what about numeric, logical, and character data? How can we handle missing data or even remove it from our data? How can we simulate from statistical distributions? In this hour, we answer these questions by introducing you to some of the most common utility functions in R that you will find yourself using every day.

Using R Functions

You have already used a number of functions in the previous hours, including seq, matrix, length, and factor. However, before we look many more useful functions, it is handy to know how to work with functions in R.

When you call a function in R, you use the function name with a number of arguments, which you give inside round brackets, to pass information to that function about how it should run and what data it should use. So how do you know what the arguments to a function are? You can either look in the help file—using ?functionName or help("functionName")—or you can use a function called args, which will print the arguments to a function in the console. As an example of using a function, we will look at sample. This function allows us to randomly sample a number of values from a vector of given values (this is the R way of selecting balls from an urn). So let’s take a look at the arguments to this function:

> args(sample)
function (x, size, replace = FALSE, prob = NULL)
NULL

You can see that we have four arguments to this function. You will notice that the first two are simply given as x and size, whereas the second two are followed by = value. This indicates that they have a default value, so we don’t need to supply an alternative. Because x and size do not have a default, we have to tell R what value we want them to take. To know the purpose of the arguments, you will need to take a look at the help files, which will tell you more. In this case, x is the vector that we want to sample from and size is the number of samples we want to take, whereas replace allows us to put values back, and we can set the probability of each value with prob.

When it comes to calling the function, we can supply the arguments in a number of ways. To start with, we can name all the arguments in full:

> sample(x = c("red", "yellow", "green", "blue"), size = 2, replace = FALSE, prob =
NULL)

Because replace and size have default values, this is the same as the following:

> sample(x = c("red", "yellow", "green", "blue"), size = 2)

Using this form of complete naming of arguments, we can actually supply them in any order we like. Therefore, the preceding would do the same as this:

> sample(size = 2, x = c("red", "yellow", "green", "blue"))

It’s worth remembering that when you actually run each of these lines, you will most likely get a different result because the function is randomly sampling from the vector x.

If you provide all the arguments in the same order as the args function gives them, you do not actually need to give the names of the arguments. Therefore, we can also say this:

> sample(c("red", "yellow", "green", "blue"), 2)

In reality, you will often see, and use, a combination of naming and ordering of arguments because you will tend to remember what should come first but not the order of other arguments. Therefore, you might see something like the following:

> sample(c("red", "yellow", "green", "blue"), size = 2, replace = TRUE)

Now that you know more about how to call functions, we will look at some useful functions for various types of data.

Functions for Numeric Data

When it comes to working with numeric data, there is a whole host of functions we may want to use—from mathematical functions such as logarithms to simulating from statistical distributions. I won’t cover every single function available in R, but we will introduce you to some of the most common.

Mathematical Functions and Operators

You have already, briefly, seen that you can use R for basic arithmetic using functions such as +, -, *, and /. In R, these are known as operators, and other useful operators include ^ (power) and %% (mod). Here’s an example:

> 3^2
[1] 9
> 5 %% 3
[1] 2

Other useful mathematical functions are shown in Table 6.1.

Image

TABLE 6.1 Mathematical Functions

All these functions are used very simply with an argument, x, with the data of interest, typically a vector or matrix. However, for logarithms, you can also provide the base, which is the exponential base (natural logarithm) by default. As an example, let’s create a simple vector of values to pass to some of these functions:

> x <- seq(1, 4, by = 0.5)
> x
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0
> sqrt(x)
[1] 1.000000 1.224745 1.414214 1.581139 1.732051 1.870829 2.000000
> log(x)
[1] 0.0000000 0.4054651 0.6931472 0.9162907 1.0986123 1.2527630 1.3862944
> sin(x)
[1]  0.8414710  0.9974950  0.9092974  0.5984721  0.1411200 -0.3507832 -0.7568025

As you can see, these are very simple functions to use and they follow standard mathematical order of operations (that is, brackets, order, division, multiplication, addition, subtraction).

Statistical Summary Functions

When it comes to statistical summaries, there is a whole host of functions you could choose to use to find out more about your data. Just like the mathematical functions you saw in the previous section, these are all very simple to use, and often you need only provide the data to the function. Table 6.2 shows some of the most common summary functions.

Image

TABLE 6.2 Statistical Summaries

The first argument to all these functions is the data and should be a single vector of values. Here’s an example:

> age <- c(38, 20, 44, 41, 46, 49, 43, 23, 28, 32)
> median(age)
[1] 39.5
> mad(age)
[1] 10.3782
> range(age)
[1] 20 49

When you are working with missing data, you need to take a little extra care with these functions. Take a look at this example:

> age[3] <- NA
> median(age)
[1] NA

As you can see, when you have missing values in your data, the median function, and in fact all the statistical summary functions in Table 6.2, will return NA. Although this is a technically correct value to return, you are typically more interested in the value of the summary after removing the missing values. By using the argument na.rm, you can do this easily:

> median(age, na.rm = TRUE)
[1] 38

You will see before the end of this hour how to remove missing values from a vector without these functions.

Simulation and Statistical Distributions

For working with statistical distributions in R, we have functions for working with all of the common distributions and all the common actions. All the functions follow the same pattern of naming, which starts with a single letter to identify what we want to do and is followed by the R code name for the distribution. Table 6.3 shows some of the most common distributions available in the base R installation. Many other distributions, such as the Pareto distribution, are available in contributed packages.

Image

TABLE 6.3 R Codes for Statistical Distributions

The list of distributions in Table 6.3 is by no means an exhaustive list, and many more can be found in the help pages by simply searching the name of the distribution. As stated earlier, you will need to combine this name for the distribution with a letter that determines whether you want to sample or calculate the quantiles. The letters you will need, their purpose, and an example of structuring the function name with the Normal distribution is shown in Table 6.4.

Image

TABLE 6.4 Distribution Functions

On top of the first argument shown in Table 6.4, and which is the same for all distribution functions, there will be additional arguments specific to the distribution. For example, the Normal distribution has the arguments mean and sd that are set to the Standard Normal defaults (0 and 1 respectively), whereas the Poisson distribution has the argument lambda, which does not have a default value set. In general the arguments will be set to the “standard” values for that distribution. Where the distribution does not have a standard, default values will not be set. For example, if you wanted to simulate five values from the Normal, Poisson, and Exponential distributions, it may look something like this:

> rnorm(5)
[1] -0.23515046 -1.79043043 -0.03287786 -0.24937333 -1.00660505
> rpois(5, lambda = 3)
[1] 4 6 6 3 1
> rexp(5)
[1] 3.2443094 1.1198132 0.9365825 0.2731334 0.4363149

Although this allows you to simulate values from a distribution, you may want to generate samples from existing data. You have already seen the function for this: sample. As you have seen, this function allows you to specify the vector you want to sample from, the number of samples you want, whether you want to replace the values or not, and whether you want to change the probability of sampling particular values, which are equal by default. As an example, let’s sample from our vector of ages:

> sample(age, size = 5)
[1] 28 46 20 49 23
>sample(age, size = 5, replace = TRUE)
[1] 20 20 23 28 41

As we saw previously, the replace argument here is allowing values to be sampled again when it is set to TRUE whereas when it is set to FALSE a value cannot be sampled again after it has been sampled once.


Note: Re-creating Simulated Values

If you want to be able to re-create the random samples you generated, you will need to set the random seed. You can do this with the function set.seed, which simply takes an integer value to indicate the seed to use. You can also use this function to change the type of random number generator used. See the help documentation for more details.


Logical Data

One of the main ways you will work with logical data is to subset the data as we did in Hour 3, “Single-Mode Data Structures.” There are, however, a couple of functions you will find useful for, in particular, counting the number of cases of a condition.

First of all, it is worth knowing how logical data is stored in R. As you have seen, a logical vector contains only values that are TRUE or FALSE. In R, these are in fact stored as the numeric values 0 and 1. You can see this by using the as.numeric function to force the numeric representation of a value, like so:

> as.numeric(c(TRUE, FALSE))
[1] 1 0

Therefore, when you have a logical vector, you can actually use the numeric functions you have seen to manipulate it. Of course, finding the variance of TRUE and FALSE values is not generally something that you want to do, but the sum function will actually allow you to count the total number of TRUE occurrences. As an example, let’s see how many values in the age vector from the previous section are less than 30:

> age
[1] 38 20 NA 41 46 49 43 23 28 32
> age < 30
[1] FALSE  TRUE    NA FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
> sum(age < 30, na.rm = TRUE)
[1] 3

Another useful function for counting the TRUE versus FALSE cases is table:

> table(age < 30)

FALSE  TRUE
    6     3
> table(age < 30, useNA = "ifany")

FALSE  TRUE  <NA>
    6     3     1

You can, in fact, use the table function to display the number of cases for any vector, but as you can see, this is useful for tabulating logical cases. You will also notice that by default the function does not include missing values. However, if you set the argument useNA to "ifany", missing values will be included when they are present.

Missing Data

Many of the statistical summary functions allow you to easily remove your missing data from calculations. As you will see when we look at graphics and model fitting, missing data is simply removed. But what if you want to identify the missing values to, for example, determine how many missing values there are or to replace them in some way?

You saw in the last section that you can easily count the number of missing values using the sum function, if you are able to create a logical vector indicating which values are missing. If you were to simply test for values being equal to the missing value NA, you would in fact just be returned a series of NA’s. Here’s an example:

> age <- c(38, 20, NA, 41, 46, 49, 43, 23, 28, 32)
> age == NA
[1] NA NA NA NA NA NA NA NA NA NA

This is because we are asking R whether or not each value in the vector is equal to some value that we don’t actually know. In each case, R doesn’t know the answer! Therefore, you need to use an alternative function for determining whether a value is missing: is.na. This is actually one of a whole series of is.x functions, some of which you will see throughout this book, that allow you to test whether data is of a particular type. Therefore, in this case, you can say the following:

> is.na(age)
[1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Thus, you can count the number of cases of missing data or generate a table showing the number of missing and non-missing cases, for example:

> sum(is.na(age))
[1] 1
> table(is.na(age))
FALSE  TRUE
    9     1

Alternatively, you may want to replace your missing values with the mean of the data, or some other value. A useful function for doing this is the replace function. Although this function is not restricted to working with missing data, this is often what you’ll be interested in doing. You need to provide this function with three pieces of information: first, the vector of data; second, a condition that returns TRUE and FALSE values to determine which values should be replaced; third, the value to be inserted. Suppose, for example, we wanted to replace the missing age value in the age vector with the mean of the remainder of the age values:

> meanAge <- mean(age, na.rm = TRUE)
> missingObs <- is.na(age)
> age <- replace(age, missingObs, meanAge)
> age
[1] 38.00000 20.00000 35.55556 41.00000 46.00000 49.00000 43.00000
[8] 23.00000 28.00000 32.00000

Of course, if we simply wanted to remove the missing values from our data, we could use is.na in combination with the “not” operation (!), along with the standard subscripting techniques you saw in Hour 3. Here’s how to remove the missing values from our age vector:

> age[!is.na(age)]
[1] 38 20 41 46 49 43 23 28 32


Tip: Missing Data Functions

A number of useful functions for working with missing data can be found in the zoo package. This includes functions such as na.locf for the last observation carried forward and na.trim for trimming leading and trailing missing values.


Character Data

We can often find ourselves having to perform string manipulation tasks in R, including creation of character strings and searching for patterns in character strings. In this section, we look at some of the functions in the base R installation, but if you are particularly interested in manipulating strings, you may be interested in the stringr and stringi packages.

Simple Character Manipulation

Some of the basic manipulations you’ll want to perform are counting characters, extracting substrings, and combining elements to create or update a string. Let’s start with counting characters. You do this using the nchar function, simply providing the string that you are interested in:

> fruits <- "apples, oranges, pears"
> nchar(fruits)
[1] 22

Notice that all characters are counted, including the spaces. To extract substrings, you use the substring function. Here, you need to give the string along with the start and end points for the substring. You can extract multiple substrings by giving the vectors of the start and end points.

> substring(fruits, 1, 6)
[1] "apples"
> fruits <- substring(fruits, c(1, 9, 18), c(6, 15, 22))
> fruits
[1] "apples"  "oranges" "pears"

Finally, you can create a character string from a series of strings or numeric values using the paste function. You can provide as many strings and objects as you wish to the paste function and they will all be converted to character data and pasted together. Like with many R functions, you can pass vectors to the paste function. Here’s an example:

> paste(5, "apples")
[1] "5 apples"
> nfruits <- c(5, 9, 2)
> paste(nfruits, fruits)
[1] "5 apples"  "9 oranges" "2 pears"

You can use the argument sep to change the separator between the pasted strings, which as you can see in the preceding example is a space by default, like so:

> paste(fruits, nfruits, sep = " = ")
[1] "apples = 5"  "oranges = 9" "pears = 2"

Searching and Replacing

Two of the most useful functions for working with character data are the functions grep and gsub. These functions allow you to search elements of a vector for a particular pattern (grep) and replace a particular pattern with a given string (gsub). You search for patterns using regular expressions (that is, a pattern that describes the character string).


Tip: Regular Expressions

Much more information on regular expressions can be found in the R help pages for the function regex. If you are familiar with Perl expressions, you can use these along with the argument perl = TRUE.


Let’s start by looking at the function grep. The first argument that we are going to give is the pattern to search for, which can be as simple as the string "red". The second argument will be the vector to search.

> colourStrings <- c("green", "blue", "orange", "red", "yellow",
+                    "lightblue", "navyblue", "indianred")
> grep("red", colourStrings, value = TRUE)
[1] "red"       "indianred"

In this example, we have used an additional argument, value. This allows us to return the actual values of the vector that include the pattern rather than simply the index of their position in the vector. Some more examples of using the grep function, with a variety of regular expressions, are shown in Listing 6.1.

LISTING 6.1 Searching Character Strings


 1: > colourStrings <- c("green", "blue", "orange", "red", "yellow",
 2: +                    "lightblue", "navyblue", "indianred")
 3: >
 4: > grep("^red", colourStrings, value = TRUE)
 5: [1] "red"
 6: > grep("red$", colourStrings, value = TRUE)
 7: [1] "indianred"
 8: >
 9: > grep("r+", colourStrings, value = TRUE)
10: [1] "green"     "orange"    "red"       "indianred"
11: >
12: > grep("e{2}", colourStrings, value = TRUE)
13: [1] "green"


In lines 4 and 6 you can see how the symbols ^ and $ have been used to mark the start and end of the string. In the example in line 4, we are specifying that immediately following the start of the string is the pattern "red", whereas in line 6 the string ends straight after the pattern "red". The examples in lines 9 and 12 show how to specify that something must appear a given number of times. In line 9, the + indicates that the letter r should appear at least once in the string. In line 12, the {2} following the e indicates that there should be two occurrences of the letter.

The gsub function, which allows you to substitute a pattern for a value, is very similar, because you also use regular expressions to search for the pattern. The only additional information you need to give is what to substitute in its place. Here is an example:

> gsub("red", "brown", colourStrings)
[1] "green"       "blue"        "orange"      "brown"       "yellow"
[6] "lightblue"   "navyblue"    "indianbrown"

As with grep, you can use any regular expression to match the pattern you wish to replace.

Summary

In this hour, you saw a number of useful functions when working with a variety of data types. You saw some of the standard mathematical and statistical functions, as well as simulation functions. You also saw how to manipulate character strings, logical values, and missing data. We will use many of these functions throughout the rest of this book, though this is by no means an exhaustive list of useful functions you will find in R. In the next hour, we will start to look at how to write our own functions for common actions we want to perform.

Q&A

Q. I want to simulate data from a distribution that is not listed here. What do I do?

A. First of all, try searching the help documentation using the name of the distribution. We have not given an exhaustive list of all available distributions in this hour, so there is a good chance we just haven’t listed it. If you don’t immediately find it in the base R help documentation, it may be that there is a package that includes the distribution functions you need; for example, the Pareto distribution can be found in the package evir, among others.

Q. I am trying to use regular expressions to find a particular value to replace, but I simply get back the original vector. Why isn’t it replacing my pattern?

A. If you find that while using gsub you are returned the original vector, it is most likely because your pattern or regular expression is not specific enough to find the pattern. Try being even more specific by thinking about what will be at the start of the string, whether there may be spaces, and how many occurrences of a pattern there may be.

Workshop

The workshop contains quiz questions and exercises to help you solidify your understanding of the material covered. Try to answer all questions before looking at the “Answers” section that follows.

Quiz

1. Take a look at the following three function calls. Would they all give the same result?

A. matrix(1:9, 3, 3)

B. matrix(nrow = 3, ncol = 3, data = 1:9)

C. matrix(data = 1:9, nrow = 3, ncol = 3)

2. What function would you need to call to find the 95% quantile of the standard Exponential distribution?

3. How is logical data stored in R?

4. What function would you use to test whether your data is missing?

5. What is the purpose of the function paste?

Answers

1. Yes, all three would produce the same matrix. When you name the arguments, it doesn’t matter what order you provide them in, and as long as you give the arguments in the correct order there is no need to name them. Typically, you will see a combination of naming and ordering of arguments.

2. For the quantiles, you use the q* functions along with the distribution code, which in this case would be exp, so you would call

> qexp(0.95)

3. Although you see TRUE and FALSE in vectors of logical data, they are actually stored as 1 and 0. This is what allows you to take the sum to find the number of TRUE cases.

4. You test for missing values using the function is.na.

5. The paste function allows you to combine character strings and vectors of values. This is particularly useful if you wanted to, for example, create character strings for a plot title from a fixed string and a value in the data.

Activities

1. Using the Normal distribution, simulate 50 values with the same mean and standard deviation as the Ozone variable in the airquality data.

2. What is the range of values in your simulated data?

3. How many values in your simulated data are larger than the mean of your data?

4. A function in R called colors returns a vector of all colors known by name. Using the grep function, create a vector that contains only colors that contain the string "blue".

5. How many colors do you have in your vector of blues?

6. Replace the pattern "blue" with "green" throughout your vector.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset