Chapter 6. Environments and Functions

We’ve already used a variety of the functions that come with R. In this chapter, you’ll learn what a function is, and how to write your own. Before that, we’ll take a look at environments, which are used to store variables.

Chapter Goals

After reading this chapter, you should:

  • Know what an environment is, and how to create one
  • Be able to create, access, and list variables within an environment
  • Understand the components that make up a function
  • Be able to write your own functions
  • Understand variable scope

Environments

All the variables that we create need to be stored somewhere, and that somewhere is an environment. Environments themselves are just another type of variable—we can assign them, manipulate them, and pass them into functions as arguments, just like we would any other variable. They are closely related to lists in that they are used for storing different types of variables together. In fact, most of the syntax for lists also works for environments, and we can coerce a list to be an environment (and vice versa).

Usually, you won’t need to explicitly deal with environments. For example, when you assign a variable at the command prompt, it will automatically go into an environment called the global environment (also known as the user workspace). When you call a function, an environment is automatically created to store the function-related variables. Understanding the basics of environments can be useful, however, in understanding the scope of variables, and for examining the call stack when debugging your code.

Slightly annoyingly, environments aren’t created with the environment function (that function returns the environment that contains a particular function). Instead, what we want is new.env:

an_environment <- new.env()

Assigning variables into environments works in exactly the same way as with lists. You can either use double square brackets or the dollar sign operator. As with lists, the variables can be of different types and sizes:

an_environment[["pythag"]] <- c(12, 15, 20, 21) #See http://oeis.org/A156683
an_environment$root <- polyroot(c(6, -5, 1))

The assign function that we saw in Assigning Variables takes an optional environment argument that can be used to specify where the variable is stored:

assign(
  "moonday",
  weekdays(as.Date("1969/07/20")),
  an_environment
)

Retrieving the variables works in the same way—you can either use list-indexing syntax, or assign’s opposite, the get function:

an_environment[["pythag"]]
## [1] 12 15 20 21
an_environment$root
## [1] 2+0i 3-0i
get("moonday", an_environment)
## [1] "Sunday"

The ls and ls.str functions also take an environment argument, allowing you to list their contents:

ls(envir = an_environment)
## [1] "moonday" "pythag"  "root"
ls.str(envir = an_environment)
## moonday :  chr "Sunday"
## pythag :  num [1:4] 12 15 20 21
## root :  cplx [1:2] 2+0i 3-0i

We can test to see if a variable exists in an environment using the exists function:

exists("pythag", an_environment)
## [1] TRUE

Conversion from environment to list and back again uses the obvious functions, as.list and as.environment. In the latter case, there is also a function list2env that allows for a little more flexibility in the creation of the environment:

#Convert to list
(a_list <- as.list(an_environment))
## $pythag
## [1] 12 15 20 21
##
## $moonday
## [1] "Sunday"
##
## $root
## [1] 2+0i 3-0i
#...and back again.  Both lines of code do the same thing.
as.environment(a_list)
## <environment: 0x000000004a6fe290>
list2env(a_list)
## <environment: 0x000000004ad10288>

All environments are nested, meaning that they must have a parent environment (the exception is a special environment called the empty environment that sits at the top of the chain). By default, the exists and get functions will also look for variables in the parent environments. Pass inherits = FALSE to them to change this behavior so that they will only look in the environment that you’ve specified:

nested_environment <- new.env(parent = an_environment)
exists("pythag", nested_environment)
## [1] TRUE
exists("pythag", nested_environment, inherits = FALSE)
## [1] FALSE

Note

The word “frame” is used almost interchangeably with “environment.” (See section 2.1.10 of the R Language Definition manual that ships with R for the technicalities.) This means that some functions that work with environments have “frame” in their name, parent.frame being the most common of these.

Shortcut functions are available to access both the global environment (where variables that you assign from the command prompt are stored) and the base environment (this contains functions and other variables from R’s base package, which provides basic functionality):

non_stormers <<- c(3, 7, 8, 13, 17, 18, 21) #See http://oeis.org/A002312
get("non_stormers", envir = globalenv())
## [1]  3  7  8 13 17 18 21
head(ls(envir = baseenv()), 20)
##  [1] "-"                 "-.Date"            "-.POSIXt"
##  [4] "!"                 "!.hexmode"         "!.octmode"
##  [7] "!="                "$"                 "$.data.frame"
## [10] "$.DLLInfo"         "$.package_version" "$<-"
## [13] "$<-.data.frame"    "%%"                "%*%"
## [16] "%/%"               "%in%"              "%o%"
## [19] "%x%"               "&"

There are two other situations where we might encounter environments. First, whenever a function is called, all the variables defined by the function are stored in an environment belonging to that function (a function plus its environment is sometimes called a closure). Second, whenever we load a package, the functions in that package are stored in an environment on the search path. This will be discussed in Chapter 10.

Functions

While most variable types are for storing data, functions let us do things with data—they are “verbs” rather than “nouns.” Like environments, they are just another data type that we can assign and manipulate and even pass into other functions.

Creating and Calling Functions

In order to understand functions better, let’s take a look at what they consist of.

Typing the name of a function shows you the code that runs when you call it. This is the rt function, which generates random numbers from a t-distribution:[20]

rt
## function (n, df, ncp)
## {
##     if (missing(ncp))
##         .External(C_rt, n, df)
##     else rnorm(n, ncp)/sqrt(rchisq(n, df)/df)
## }
## <bytecode: 0x0000000019738e10>
## <environment: namespace:stats>

As you can see, rt takes up to three input arguments: n is the number of random numbers to generate, df is the number of degrees of freedom, and ncp is an optional noncentrality parameter. To be technical, the three arguments n, df, and ncp are the formal arguments of rt. When you are calling the function and passing values to it, those values are just called arguments.

Note

The difference between arguments and formal arguments isn’t very important, so the rest of the book doesn’t make an effort to differentiate between the two concepts.

In between the curly braces, you can see the lines of code that constitute the body of the function. This is the code that is executed each time you call rt.

Notice that there is no explicit “return” keyword to state which value should be returned from the function. In R, the last value that is calculated in the function is automatically returned. In the case of rt, if the ncp argument is omitted, some C code is called to generate the random numbers, and those are returned. Otherwise, the function calls the rnorm, rchisq, and sqrt functions to generate the numbers, and those are returned.

To create our own functions, we just assign them as we would any other variable. As an example, let’s create a function to calculate the length of the hypotenuse of a right-angled triangle (for simplicity, we’ll use the obvious algorithm; for real-world code, this doesn’t work well with very big and very small numbers, so you shouldn’t calculate hypotenuses this way):

hypotenuse <- function(x, y)
{
  sqrt(x ^ 2 + y ^ 2)
}

Here, hypotenuse is the name of the function we are creating, x and y are its (formal) arguments, and the contents of the braces are the function body.

Actually, since our function body is only one line of code, we can omit the braces:

hypotenuse <- function(x, y) sqrt(x ^ 2 + y ^ 2)   #same as before

Note

R is very permissive about how you space your code, so “one line of code” can be stretched to run over several lines. The amount of code that can be included without braces is one statement. The exact definition of a statement is technical, but from a practical point of view, it is the amount of code that you can type at the command line before it executes.

We can now call this function as we would any other:

hypotenuse(3, 4)
## [1] 5
hypotenuse(y = 24, x = 7)
## [1] 25

When we call a function, if we don’t name the arguments, then R will match them based on position. In the case of hypotenuse(3, 4), 3 comes first so it is mapped to x, and 4 comes second so it is mapped to y.

If we want to change the order that we pass the arguments, or omit some of them, then we can pass named arguments. In the case of hypotenuse(y = 24, x = 7), although we pass the variables in the “wrong” order, R still correctly determines which variable should be mapped to x, and which to y.

It doesn’t make much sense for a hypotenuse-calculating function, but if we wanted, we could provide default values for x and y. In this new version, if we don’t pass anything to the function, x takes the value 5 and y takes the value 12:

hypotenuse <- function(x = 5, y = 12)
{
  sqrt(x ^ 2 + y ^ 2)
}
hypotenuse() #equivalent to hypotenuse(5, 12)
## [1] 13

We’ve already seen the formals function for retrieving the arguments of a function as a (pair)list. The args function does the same thing in a more human-readable, but less programming-friendly, way. formalArgs returns a character vector of the names of the arguments:

formals(hypotenuse)
## $x
## [1] 5
##
## $y
## [1] 12
args(hypotenuse)
## function (x = 5, y = 12)
## NULL
formalArgs(hypotenuse)
## [1] "x" "y"

The body of a function is retrieved using the body function. This isn’t often very useful on its own, but we may sometimes want to examine it as text—to find functions that call another function, for example. We can use deparse to achieve this:

(body_of_hypotenuse <- body(hypotenuse))
## {
##     sqrt(x^2 + y^2)
## }
deparse(body_of_hypotenuse)
## [1] "{"                   "    sqrt(x^2 + y^2)" "}"

The default values given to formal arguments of functions can be more than just constant values—we can pass any R code into them, and even use other formal arguments. The following function, normalize, scales a vector. The arguments m and s are, by default, the mean and standard deviation of the first argument, so that the returned vector will have mean 0 and standard deviation 1:

normalize <- function(x, m = mean(x), s = sd(x))
{
  (x - m) / s
}
normalized <- normalize(c(1, 3, 6, 10, 15))
mean(normalized)        #almost 0!
## [1] -5.573e-18
sd(normalized)
## [1] 1

There is a little problem with our normalize function, though, which we can see if some of the elements of x are missing:

normalize(c(1, 3, 6, 10, NA))
## [1] NA NA NA NA NA

If any elements of a vector are missing, then by default, mean and sd will both return NA. Consequently, our normalize function returns NA values everywhere. It might be preferable to have the option of only returning NA values where the input was NA. Both mean and sd have an argument, na.rm, that lets us remove missing values before any calculations occur. To avoid all the NA values, we could include such an argument in normalize:

normalize <- function(x, m = mean(x, na.rm = na.rm),
  s = sd(x, na.rm = na.rm), na.rm = FALSE)
{
  (x - m) / s
}
normalize(c(1, 3, 6, 10, NA))
## [1] NA NA NA NA NA
normalize(c(1, 3, 6, 10, NA), na.rm = TRUE)
## [1] -1.0215 -0.5108  0.2554  1.2769      NA

This works, but the syntax is a little clunky. To save us having to explicitly type the names of arguments that aren’t actually used by the function (na.rm is only being passed to mean and sd), R has a special argument, ..., that contains all the arguments that aren’t matched by position or name:

normalize <- function(x, m = mean(x, ...), s = sd(x, ...), ...)
{
  (x - m) / s
}
normalize(c(1, 3, 6, 10, NA))
## [1] NA NA NA NA NA
normalize(c(1, 3, 6, 10, NA), na.rm = TRUE)
## [1] -1.0215 -0.5108  0.2554  1.2769      NA

Now in the call normalize(c(1, 3, 6, 10, NA), na.rm = TRUE), the argument na.rm does not match any of the formal arguments of normalize, since it isn’t x or m or s. That means that it gets stored in the ... argument of normalize. When we evaluate m, the expression mean(x, ...) is now mean(x, na.rm = TRUE).

If this isn’t clear right now, don’t worry. How this works is an advanced topic, and most of the time we don’t need to worry about it. For now, you just need to know that ... can be used to pass arguments to subfunctions.

Passing Functions to and from Other Functions

Functions can be used just like other variable types, so we can pass them as arguments to other functions, and return them from functions. One common example of a function that takes another function as an argument is do.call. This function provides an alternative syntax for calling other functions, letting us pass the arguments as a list, rather than one at a time:

do.call(hypotenuse, list(x = 3, y = 4)) #same as hypotenuse(3, 4)
## [1] 5

Perhaps the most common use case for do.call is with rbind. You can use these two functions together to concatenate several data frames or matrices together at once:

dfr1 <- data.frame(x = 1:5, y = rt(5, 1))
dfr2 <- data.frame(x = 6:10, y = rf(5, 1, 1))
dfr3 <- data.frame(x = 11:15, y = rbeta(5, 1, 1))
do.call(rbind, list(dfr1, dfr2, dfr3)) #same as rbind(dfr1, dfr2, dfr3)
##     x        y
## 1   1  1.10440
## 2   2  0.87931
## 3   3 -1.18288
## 4   4 -1.04847
## 5   5  0.90335
## 6   6  0.27186
## 7   7  2.49953
## 8   8  0.89534
## 9   9  4.21537
## 10 10  0.07751
## 11 11  0.31153
## 12 12  0.29114
## 13 13  0.01079
## 14 14  0.97188
## 15 15  0.53498

It is worth spending some time getting comfortable with this idea. In Chapter 9, we’re going to make a lot of use of passing functions to other functions with apply and its derivatives.

When using functions as arguments, it isn’t necessary to assign them first. In the same way that we could simplify this:

menage <- c(1, 0, 0, 1, 2, 13, 80) #See http://oeis.org/A000179
mean(menage)
## [1] 13.86

to:

mean(c(1, 0, 0, 1, 2, 13, 80))
## [1] 13.86

we can also pass functions anonymously:

x_plus_y <- function(x, y) x + y
do.call(x_plus_y, list(1:5, 5:1))
## [1] 6 6 6 6 6
#is the same as
do.call(function(x, y) x + y, list(1:5, 5:1))
## [1] 6 6 6 6 6

Functions that return functions are rarer, but no less valid for it. The ecdf function returns the empirical cumulative distribution function of a vector, as seen in Figure 6-1:

(emp_cum_dist_fn <- ecdf(rnorm(50)))
## Empirical CDF
## Call: ecdf(rnorm(50))
##  x[1:50] = -2.2, -2.1,  -2,  ..., 1.9, 2.6
is.function(emp_cum_dist_fn)
## [1] TRUE
plot(emp_cum_dist_fn)
An empirical cumulative distribution function.

Figure 6-1. An empirical cumulative distribution function

Variable Scope

A variable’s scope is the set of places from which you can see the variable. For example, when you define a variable inside a function, the rest of the statements in that function will have access to that variable. In R (but not S), subfunctions will also have access to that variable. In this next example, the function f takes a variable x and passes it to the function g. f also defines a variable y, which is within the scope of g, since g is a subfunction of f. So, even though y isn’t defined inside g, the example works:

f <- function(x)
{
  y <- 1
  g <- function(x)
  {
    (x + y) / 2 #y is used, but is not a formal argument of g
  }
  g(x)
}
f(sqrt(5))      #It works! y is magically found in the environment of f
## [1] 1.618

If we modify the example to define g outside of f, so it is not a subfunction of f, the example will throw an error, since R cannot find y:

f <- function(x)
{
  y <- 1
  g(x)
}
g <- function(x)
{
  (x + y) / 2
}
f(sqrt(5))
##  January February    March    April      May
##   0.6494   1.4838   0.9665   0.4527   0.7752

In the section Environments, we saw that the get and exists functions look for variables in parent environments as well as the current one. Variable scope works in exactly the same way: R will try to find variables in the current environment, and if it doesn’t find them it will look in the parent environment, and then that environment’s parent, and so on until it reaches the global environment. Variables defined in the global environment can be seen from anywhere, which is why they are called global variables.

In our first example, the environment belonging to f is the parent environment of the environment belonging to g, which is why y can be found. In the second example, the parent environment of g is the global environment, which doesn’t contain a variable y, which is why an error is thrown.

This system of scoping where variables can be found in parent environments is often useful, but also brings the potential for mischief and awful, unmaintainable code. Consider the following function, h:

h <- function(x)
{
  x * y
}

It looks like it shouldn’t work, since it accepts a single argument, x, but uses two arguments, x and y, in its body. Let’s try it, with a clean user workspace:

h(9)
##  January February    March    April      May
##   -8.436    6.583   -2.727  -11.976   -6.171

So far, our intuition holds. y is not defined, so the function throws an error. Now look at what happens if we define y in the user workspace:

y <- 16
h(9)
## [1] 144

When R fails to find a variable named y in the environment belonging to h, it looks in h’s parent—the user workspace (a.k.a. global environment), where y is defined—and the product is correctly calculated.

Global variables should be used sparingly, since they make it very easy to write appalling code. In this modified function, h2, y is randomly locally defined half the time. With y defined in the user workspace, when we evaluate it y will be randomly local or global!

h2 <- function(x)
{
  if(runif(1) > 0.5) y <- 12
  x * y
}

Let’s use replicate to run the code several times to see the result:

replicate(10, h2(9))
##  [1] 144 144 144 108 144 108 108 144 108 108

When the uniform random number (between 0 and 1) generated by runif is greater than 0.5, a local variable y is assigned the value 12. Otherwise, the global value of 16 is used.

As I’m sure you’ve noticed, it is very easy to create obscure bugs in code by doing things like this. Usually it is better to explicitly pass all the variables that we need into a function.

Summary

  • Environments store variables and can be created with new.env.
  • You can treat environments like lists most of the time.
  • All environments have a parent environment (except the empty environment at the top).
  • Functions consist of formal arguments and a body.
  • You can assign and use functions just as you would any other variable type.
  • R will look for variables in the current environment and its parents.

Test Your Knowledge: Quiz

Question 6-1
What is another name for the global environment?
Question 6-2
How would you convert a list to an environment?
Question 6-3
How do you print the contents of a function to the console?
Question 6-4
Name three functions that tell you the names of the formal arguments of a function.
Question 6-5
What does the do.call function do?

Test Your Knowledge: Exercises

Exercise 6-1

Create a new environment named multiples_of_pi. Assign these variables into the environment:

  1. two_pi, with the value 2 * π, using double square brackets
  2. three_pi, with the value 3 * π, using the dollar sign operator
  3. four_pi, with the value 4 * π, using the assign function

    List the contents of the environment, along with their values. [10]

Exercise 6-2
Write a function that accepts a vector of integers (for simplicity, you don’t have to worry about input checking) and returns a logical vector that is TRUE whenever the input is even, FALSE whenever the input is odd, and NA whenever the input is nonfinite (nonfinite means anything that will make is.finite return FALSE: Inf, -Inf, NA, and NaN). Check that the function works with positive, negative, zero, and nonfinite inputs. [10]
Exercise 6-3
Write a function that accepts a function as an input and returns a list with two elements: an element named args that contains a pairlist of the input’s formal arguments, and an element named body that contains the input’s body. Test it by calling the function with a variety of inputs. [10]


[20] If the definition is a single line that says something like UseMethod("my_function") or standardGeneric("my_function"), see Object-Oriented Programming in Chapter 16. If R complains that the object is not found, try getAnywhere(my_function).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset