We’ve already used a variety of the functions that come with R. In this chapter, you’ll learn what a function is, and how to write your own. Before that, we’ll take a look at environments, which are used to store variables.
After reading this chapter, you should:
All the variables that we create need to be stored somewhere, and that somewhere is an environment. Environments themselves are just another type of variable—we can assign them, manipulate them, and pass them into functions as arguments, just like we would any other variable. They are closely related to lists in that they are used for storing different types of variables together. In fact, most of the syntax for lists also works for environments, and we can coerce a list to be an environment (and vice versa).
Usually, you won’t need to explicitly deal with environments. For example, when you assign a variable at the command prompt, it will automatically go into an environment called the global environment (also known as the user workspace). When you call a function, an environment is automatically created to store the function-related variables. Understanding the basics of environments can be useful, however, in understanding the scope of variables, and for examining the call stack when debugging your code.
Slightly annoyingly, environments aren’t created with the environment
function (that function returns the environment that contains a particular function). Instead, what we want is new.env
:
an_environment<-
new.env()
Assigning variables into environments works in exactly the same way as with lists. You can either use double square brackets or the dollar sign operator. As with lists, the variables can be of different types and sizes:
an_environment[[
"pythag"
]]
<-
c(
12
,
15
,
20
,
21
)
#See http://oeis.org/A156683
an_environment$
root<-
polyroot(
c(
6
,
-5
,
1
))
The assign
function that we saw in Assigning Variables takes an optional environment argument that can be used to specify where the variable is stored:
assign(
"moonday"
,
weekdays(
as.Date(
"1969/07/20"
)),
an_environment)
Retrieving the variables works in the same way—you can either use list-indexing syntax, or assign
’s opposite, the get
function:
an_environment[[
"pythag"
]]
## [1] 12 15 20 21
an_environment$
root
## [1] 2+0i 3-0i
get(
"moonday"
,
an_environment)
## [1] "Sunday"
The ls
and ls.str
functions also take an environment argument, allowing you to list their contents:
ls(
envir=
an_environment)
## [1] "moonday" "pythag" "root"
ls.str(
envir=
an_environment)
## moonday : chr "Sunday" ## pythag : num [1:4] 12 15 20 21 ## root : cplx [1:2] 2+0i 3-0i
We can test to see if a variable exists in an environment using the exists
function:
exists(
"pythag"
,
an_environment)
## [1] TRUE
Conversion from environment to list and back again uses the obvious functions, as.list
and as.environment
. In the latter case, there is also a function list2env
that allows for a little more flexibility in the creation of the environment:
#Convert to list
(
a_list<-
as.list(
an_environment))
## $pythag ## [1] 12 15 20 21 ## ## $moonday ## [1] "Sunday" ## ## $root ## [1] 2+0i 3-0i
#...and back again. Both lines of code do the same thing.
as.environment(
a_list)
## <environment: 0x000000004a6fe290>
list2env(
a_list)
## <environment: 0x000000004ad10288>
All environments are nested, meaning that they must have a parent environment (the exception is a special environment called the empty environment that sits at the top of the chain). By default, the exists
and get
functions will also look for variables in the parent environments. Pass inherits = FALSE
to them to change this behavior so that they will only look in the environment that you’ve specified:
nested_environment<-
new.env(
parent=
an_environment)
exists(
"pythag"
,
nested_environment)
## [1] TRUE
exists(
"pythag"
,
nested_environment,
inherits=
FALSE
)
## [1] FALSE
The word “frame” is used almost interchangeably with “environment.” (See section 2.1.10 of the R Language Definition manual that ships with R for the technicalities.) This means that some functions that work with environments have “frame” in their name, parent.frame
being the most common of these.
Shortcut functions are available to access both the global environment (where variables that you assign from the command prompt are stored) and the base environment (this contains functions and other variables from R’s base package, which provides basic functionality):
non_stormers<<-
c(
3
,
7
,
8
,
13
,
17
,
18
,
21
)
#See http://oeis.org/A002312
get(
"non_stormers"
,
envir=
globalenv())
## [1] 3 7 8 13 17 18 21
head(
ls(
envir=
baseenv()),
20
)
## [1] "-" "-.Date" "-.POSIXt" ## [4] "!" "!.hexmode" "!.octmode" ## [7] "!=" "$" "$.data.frame" ## [10] "$.DLLInfo" "$.package_version" "$<-" ## [13] "$<-.data.frame" "%%" "%*%" ## [16] "%/%" "%in%" "%o%" ## [19] "%x%" "&"
There are two other situations where we might encounter environments. First, whenever a function is called, all the variables defined by the function are stored in an environment belonging to that function (a function plus its environment is sometimes called a closure). Second, whenever we load a package, the functions in that package are stored in an environment on the search path. This will be discussed in Chapter 10.
While most variable types are for storing data, functions let us do things with data—they are “verbs” rather than “nouns.” Like environments, they are just another data type that we can assign and manipulate and even pass into other functions.
In order to understand functions better, let’s take a look at what they consist of.
Typing the name of a function shows you the code that runs when you call it. This is the rt
function, which generates random numbers from a t-distribution:[20]
rt
## function (n, df, ncp) ## { ## if (missing(ncp)) ## .External(C_rt, n, df) ## else rnorm(n, ncp)/sqrt(rchisq(n, df)/df) ## } ## <bytecode: 0x0000000019738e10> ## <environment: namespace:stats>
As you can see, rt
takes up to three input arguments: n
is the number of random numbers to generate, df
is the number of degrees of freedom, and ncp
is an optional noncentrality parameter. To be technical, the three arguments n
, df
, and ncp
are the formal arguments of rt
. When you are calling the function and passing values to it, those values are just called arguments.
The difference between arguments and formal arguments isn’t very important, so the rest of the book doesn’t make an effort to differentiate between the two concepts.
In between the curly braces, you can see the lines of code that constitute the body of the function. This is the code that is executed each time you call rt
.
Notice that there is no explicit “return” keyword to state which value should be returned from the function. In R, the last value that is calculated in the function is automatically returned. In the case of rt
, if the ncp
argument is omitted, some C code is called to generate the random numbers, and those are returned. Otherwise, the function calls the rnorm
, rchisq
, and sqrt
functions to generate the numbers, and those are returned.
To create our own functions, we just assign them as we would any other variable. As an example, let’s create a function to calculate the length of the hypotenuse of a right-angled triangle (for simplicity, we’ll use the obvious algorithm; for real-world code, this doesn’t work well with very big and very small numbers, so you shouldn’t calculate hypotenuses this way):
hypotenuse<-
function
(
x,
y)
{
sqrt(
x^
2
+
y^
2
)
}
Here, hypotenuse
is the name of the function we are creating, x
and y
are its (formal) arguments, and the contents of the braces are the function body.
Actually, since our function body is only one line of code, we can omit the braces:
hypotenuse<-
function
(
x,
y)
sqrt(
x^
2
+
y^
2
)
#same as before
R is very permissive about how you space your code, so “one line of code” can be stretched to run over several lines. The amount of code that can be included without braces is one statement. The exact definition of a statement is technical, but from a practical point of view, it is the amount of code that you can type at the command line before it executes.
We can now call this function as we would any other:
hypotenuse(
3
,
4
)
## [1] 5
hypotenuse(
y=
24
,
x=
7
)
## [1] 25
When we call a function, if we don’t name the arguments, then R will match them based on position. In the case of hypotenuse(3, 4)
, 3
comes first so it is mapped to x
, and 4
comes second so it is mapped to y
.
If we want to change the order that we pass the arguments, or omit some of them, then we can pass named arguments. In the case of hypotenuse(y = 24, x = 7)
, although we pass the variables in the “wrong” order, R still correctly determines which variable should be mapped to x
, and which to y
.
It doesn’t make much sense for a hypotenuse-calculating function, but if we wanted, we could provide default values for x
and y
. In this new version, if we don’t pass anything to the function, x
takes the value 5
and y
takes the value 12
:
hypotenuse<-
function
(
x=
5
,
y=
12
)
{
sqrt(
x^
2
+
y^
2
)
}
hypotenuse()
#equivalent to hypotenuse(5, 12)
## [1] 13
We’ve already seen the formals
function for retrieving the arguments of a function as a (pair)list. The args
function does the same thing in a more human-readable, but less programming-friendly, way. formalArgs
returns a character vector of the names of the arguments:
formals(
hypotenuse)
## $x ## [1] 5 ## ## $y ## [1] 12
args(
hypotenuse)
## function (x = 5, y = 12) ## NULL
formalArgs(
hypotenuse)
## [1] "x" "y"
The body of a function is retrieved using the body
function. This isn’t often very useful on its own, but we may sometimes want to examine it as text—to find functions that call another function, for example. We can use deparse
to achieve this:
(
body_of_hypotenuse<-
body(
hypotenuse))
## { ## sqrt(x^2 + y^2) ## }
deparse(
body_of_hypotenuse)
## [1] "{" " sqrt(x^2 + y^2)" "}"
The default values given to formal arguments of functions can be more than just constant values—we can pass any R code into them, and even use other formal arguments. The following function, normalize
, scales a vector. The arguments m
and s
are, by default, the mean and standard deviation of the first argument, so that the returned vector will have mean 0 and standard deviation 1:
normalize<-
function
(
x,
m=
mean(
x),
s=
sd(
x))
{
(
x-
m)
/
s}
normalized<-
normalize(
c(
1
,
3
,
6
,
10
,
15
))
mean(
normalized)
#almost 0!
## [1] -5.573e-18
sd(
normalized)
## [1] 1
There is a little problem with our normalize
function, though, which we can see if some of the elements of x
are missing:
normalize(
c(
1
,
3
,
6
,
10
,
NA
))
## [1] NA NA NA NA NA
If any elements of a vector are missing, then by default, mean
and sd
will both return NA
. Consequently, our normalize
function returns NA
values everywhere. It might be preferable to have the option of only returning NA
values where the input was NA
. Both mean
and sd
have an argument, na.rm
, that lets us remove missing values before any calculations occur. To avoid all the NA
values, we could include such an argument in normalize
:
normalize<-
function
(
x,
m=
mean(
x,
na.rm=
na.rm),
s=
sd(
x,
na.rm=
na.rm),
na.rm=
FALSE
)
{
(
x-
m)
/
s}
normalize(
c(
1
,
3
,
6
,
10
,
NA
))
## [1] NA NA NA NA NA
normalize(
c(
1
,
3
,
6
,
10
,
NA
),
na.rm=
TRUE
)
## [1] -1.0215 -0.5108 0.2554 1.2769 NA
This works, but the syntax is a little clunky. To save us having to explicitly type the names of arguments that aren’t actually used by the function (na.rm
is only being passed to mean
and sd
), R has a special argument, ...
, that contains all the arguments that aren’t matched by position or name:
normalize<-
function
(
x,
m=
mean(
x,
...
),
s=
sd(
x,
...
),
...
)
{
(
x-
m)
/
s}
normalize(
c(
1
,
3
,
6
,
10
,
NA
))
## [1] NA NA NA NA NA
normalize(
c(
1
,
3
,
6
,
10
,
NA
),
na.rm=
TRUE
)
## [1] -1.0215 -0.5108 0.2554 1.2769 NA
Now in the call normalize(c(1, 3, 6, 10, NA), na.rm = TRUE)
, the argument na.rm
does not match any of the formal arguments of normalize
, since it isn’t x
or m
or s
. That means that it gets stored in the ...
argument of normalize
. When we evaluate m
, the expression mean(x, ...)
is now mean(x, na.rm = TRUE)
.
If this isn’t clear right now, don’t worry. How this works is an advanced topic, and most of the time we don’t need to worry about it. For now, you just need to know that ...
can be used to pass arguments to subfunctions.
Functions can be used just like other variable types, so we can pass them as arguments to other functions, and return them from functions. One common example of a function that takes another function as an argument is do.call
. This function provides an alternative syntax for calling other functions, letting us pass the arguments as a list, rather than one at a time:
do.call(
hypotenuse,
list(
x=
3
,
y=
4
))
#same as hypotenuse(3, 4)
## [1] 5
Perhaps the most common use case for do.call
is with rbind
. You can use these two functions together to concatenate several data frames or matrices together at once:
dfr1<-
data.frame(
x=
1
:5
,
y=
rt(
5
,
1
))
dfr2<-
data.frame(
x=
6
:10
,
y=
rf(
5
,
1
,
1
))
dfr3<-
data.frame(
x=
11
:15
,
y=
rbeta(
5
,
1
,
1
))
do.call(
rbind,
list(
dfr1,
dfr2,
dfr3))
#same as rbind(dfr1, dfr2, dfr3)
## x y ## 1 1 1.10440 ## 2 2 0.87931 ## 3 3 -1.18288 ## 4 4 -1.04847 ## 5 5 0.90335 ## 6 6 0.27186 ## 7 7 2.49953 ## 8 8 0.89534 ## 9 9 4.21537 ## 10 10 0.07751 ## 11 11 0.31153 ## 12 12 0.29114 ## 13 13 0.01079 ## 14 14 0.97188 ## 15 15 0.53498
It is worth spending some time getting comfortable with this idea. In Chapter 9, we’re going to make a lot of use of passing functions to other functions with apply
and its derivatives.
When using functions as arguments, it isn’t necessary to assign them first. In the same way that we could simplify this:
menage<-
c(
1
,
0
,
0
,
1
,
2
,
13
,
80
)
#See http://oeis.org/A000179
mean(
menage)
## [1] 13.86
to:
mean(
c(
1
,
0
,
0
,
1
,
2
,
13
,
80
))
## [1] 13.86
we can also pass functions anonymously:
x_plus_y<-
function
(
x,
y)
x+
y do.call(
x_plus_y,
list(
1
:5
,
5
:1
))
## [1] 6 6 6 6 6
#is the same as
do.call(
function
(
x,
y)
x+
y,
list(
1
:5
,
5
:1
))
## [1] 6 6 6 6 6
Functions that return functions are rarer, but no less valid for it. The ecdf
function returns the empirical cumulative distribution function of a vector, as seen in Figure 6-1:
(
emp_cum_dist_fn<-
ecdf(
rnorm(
50
)))
## Empirical CDF ## Call: ecdf(rnorm(50)) ## x[1:50] = -2.2, -2.1, -2, ..., 1.9, 2.6
is.function(
emp_cum_dist_fn)
## [1] TRUE
plot(
emp_cum_dist_fn)
A variable’s scope is the set of places from which you can see the variable. For example, when you define a variable inside a function, the rest of the statements in that function will have access to that variable. In R (but not S), subfunctions will also have access to that variable. In this next example, the function f
takes a variable x
and passes it to the function g
. f
also defines a variable y
, which is within the scope of g
, since g
is a subfunction of f
. So, even though y
isn’t defined inside g
, the example works:
f<-
function
(
x)
{
y<-
1
g<-
function
(
x)
{
(
x+
y)
/
2
#y is used, but is not a formal argument of g
}
g(
x)
}
f(
sqrt(
5
))
#It works! y is magically found in the environment of f
## [1] 1.618
If we modify the example to define g
outside of f
, so it is not a subfunction of f
, the example will throw an error, since R cannot find y
:
f<-
function
(
x)
{
y<-
1
g(
x)
}
g<-
function
(
x)
{
(
x+
y)
/
2
}
f(
sqrt(
5
))
## January February March April May ## 0.6494 1.4838 0.9665 0.4527 0.7752
In the section Environments, we saw that the get
and exists
functions look for variables in parent environments as well as the current one. Variable scope works in exactly the same way: R will try to find variables in the current environment, and if it doesn’t find them it will look in the parent environment, and then that environment’s parent, and so on until it reaches the global environment. Variables defined in the global environment can be seen from anywhere, which is why they are called global variables.
In our first example, the environment belonging to f
is the parent environment of the environment belonging to g
, which is why y
can be found. In the second example, the parent environment of g
is the global environment, which doesn’t contain a variable y
, which is why an error is thrown.
This system of scoping where variables can be found in parent environments is often useful, but also brings the potential for mischief and awful, unmaintainable code. Consider the following function, h
:
h<-
function
(
x)
{
x*
y}
It looks like it shouldn’t work, since it accepts a single argument, x
, but uses two arguments, x
and y
, in its body. Let’s try it, with a clean user workspace:
h(
9
)
## January February March April May ## -8.436 6.583 -2.727 -11.976 -6.171
So far, our intuition holds. y
is not defined, so the function throws an error. Now look at what happens if we define y
in the user workspace:
y<-
16
h(
9
)
## [1] 144
When R fails to find a variable named y
in the environment belonging to h
, it looks in h
’s parent—the user workspace (a.k.a. global environment), where y
is defined—and the product is correctly calculated.
Global variables should be used sparingly, since they make it very easy to write appalling code. In this modified function, h2
, y
is randomly locally defined half the time. With y
defined in the user workspace, when we evaluate it y
will be randomly local or global!
h2<-
function
(
x)
{
if
(
runif(
1
)
>
0.5
)
y<-
12
x*
y}
Let’s use replicate
to run the code several times to see the result:
replicate(
10
,
h2(
9
))
## [1] 144 144 144 108 144 108 108 144 108 108
When the uniform random number (between 0 and 1) generated by runif
is greater than 0.5
, a local variable y
is assigned the value 12
. Otherwise, the global value of 16
is used.
As I’m sure you’ve noticed, it is very easy to create obscure bugs in code by doing things like this. Usually it is better to explicitly pass all the variables that we need into a function.
new.env
.
do.call
function do?
Create a new environment named multiples_of_pi
. Assign these variables into the environment:
two_pi
, with the value 2 * π
, using double square brackets
three_pi
, with the value 3 * π
, using the dollar sign operator
four_pi
, with the value 4 * π
, using the assign
function
List the contents of the environment, along with their values. [10]
TRUE
whenever the input is even, FALSE
whenever the input is odd, and NA
whenever the input is nonfinite (nonfinite means anything that will make is.finite
return FALSE
: Inf
, -Inf
, NA
, and NaN
). Check that the function works with positive, negative, zero, and nonfinite inputs. [10]
args
that contains a pairlist of the input’s formal arguments, and an element named body
that contains the input’s body. Test it by calling the function with a variety of inputs. [10]
[20] If the definition is a single line that says something like UseMethod("my_function")
or standardGeneric("my_function")
, see Object-Oriented Programming in Chapter 16. If R complains that the object is not found, try getAnywhere(my_function)
.