R is a powerful tool for all manner of calculations, data manipulation and scientific computations. Before getting to the complex operations possible in R
we must start with the basics. Like most languages R
has its share of mathematical capability, variables, functions and data types.
Being a statistical programming language, R
can certainly be used to do basic math and that is where we will start.
We begin with the “Hello, World!” of basic math: 1 + 1. In the console there is a right angle bracket (>
) where code should be entered. Simply test R
by running
> 1 + 1
[1] 2
If this returns 2, then everything is great; if not, then something is very, very wrong. Assuming it worked, let’s look at some slightly more complicated expressions:
> 1 + 2 + 3
[1] 6
> 3 * 7 * 2
[1] 42
> 4/2
[1] 2
> 4/3
[1] 1.333
These follow the basic order of operations: Parenthesis, Exponents, Multiplication, Division, Addition and Subtraction (PEMDAS). This means operations inside parentheses take priority over other operations. Next on the priority list is exponentiation. After that multiplication and division are performed, followed by addition and subtraction.
This is why the first two lines in the following code have the same result while the third is different.
> 4 * 6 + 5
[1] 29
> (4 * 6) + 5
[1] 29
> 4 * (6 + 5)
[1] 44
So far we have put white space in between each operator such as *
and /
. This is not necessary but is encouraged as good coding practice.
Variables are an integral part of any programming language and R
offers a great deal of flexibility. Unlike statically typed languages such as C++, R
does not require variable types to be declared. A variable can take on any available data type as described in Section 4.3. It can also hold any R
object such as a function, the result of an analysis or a plot. A single variable can at one point hold a number, then later hold a character and then later a number again.
There are a number of ways to assign a value to a variable, and again, this does not depend on the type of value being assigned.
The valid assignment operators are <-
and =
with the first being preferred.
For example, let’s save 2 to the variable x and 5 to the variable y.
> x <- 2
> x
[1] 2
> y = 5
> y
[1] 5
The arrow operator can also point in the other direction.
> 3 <- z
> z
[1] 3
The assignment operation can be used successively to assign a value to multiple variables simultaneously.
> a <- b <- 7
> a
[1] 7
> b
[1] 7
A more laborious, though sometimes necessary, way to assign variables is to use the assign
function.
> assign("j", 4)
> j
[1] 4
Variable names can contain any combination of alphanumeric characters along with periods (.) and underscores (_). However, they cannot start with a number or an underscore.
The most common form of assignment in the R
community is the left arrow (<-
), which may seem awkward to use at first but eventually becomes second nature. It even seems to make sense, as the variable is sort of pointing to its value. There is also a particularly nice benefit for people coming from languages like SQL, where a single equal sign (=
) tests for equality.
It is generally considered best practice to use actual names, usually nouns, for variables instead of single letters. This provides more information to the person reading the code. This is seen throughout this book.
For various reasons a variable may need to be removed. This is easily done using remove
or its shortcut rm
.
> j
[1] 4
> rm(j)
> # now it is gone
> j
Error: object 'j' not found
This frees up memory so that R
can store more objects, although it does not necessarily free up memory for the operating system. To guarantee that, use gc
, which performs garbage collection, releasing unused memory to the operating system. R
automatically does garbage collection periodically, so this function is not essential.
Variable names are case sensitive, which can trip up people coming from a language like SQL or Visual Basic.
> theVariable <- 17
> theVariable
[1] 17
> THEVARIABLE
Error: object 'THEVARIABLE' not found
There are numerous data types in R
that store various kinds of data. The four main types of data most likely to be used are numeric
, character
(string), Date
/POSIXct
(time-based) and logical
(TRUE
/FALSE
).
The type of data contained in a variable is checked with the class
function.
> class(x)
[1] "numeric"
As expected, R
excels at running numbers, so numeric data is the most common type in R
. The most commonly used numeric data is numeric
. This is similar to a float
or double
in other languages. It handles integers and decimals, both positive and negative, and, of course, zero. A numeric value stored in a variable is automatically assumed to be numeric
. Testing whether a variable is numeric
is done with the function is.numeric
.
> is.numeric(x)
[1] TRUE
Another important, if less frequently used, type is integer
. As the name implies this is for whole numbers only, no decimals. To set an integer to a variable it is necessary to append the value with an L
. As with checking for a numeric
, the is.integer
function is used.
> i <- 5L
> i
[1] 5
> is.integer(i)
[1] TRUE
Do note that, even though i
is an integer
, it will also pass a numeric
check.
> is.numeric(i)
[1] TRUE
R
nicely promotes integer
s to numeric
when needed. This is obvious when multiplying an integer
by a numeric
, but importantly it works when dividing an integer
by another integer
, resulting in a decimal number.
> class(4L)
[1] "integer"
> class(2.8)
[1] "numeric"
> 4L * 2.8
[1] 11.2
> class(4L * 2.8)
[1] "numeric"
>
> class(5L)
[1] "integer"
> class(2L)
[1] "integer"
> 5L/2L
[1] 2.5
> class(5L/2L)
[1] "numeric"
Even though it is not explicitly mathematical, the character (string) data type is very common in statistical analysis and must be handled with care. R
has two primary ways of handling character data: character
and factor
. While they may seem similar on the surface, they are treated quite differently.
> x <- "data"
> x
[1] "data"
> y <- factor("data")
> y
[1] data
Levels: data
Notice that x
contains the word “data” encapsulated in quotes, while y
has the word “data” without quotes and a second line of information about the levels
of y
. That is explained further in Section 4.4.2 about vector
s.
Character
s are case sensitive, so “Data” is different from “data” or “DATA.”
To find the length of a character
(or numeric
) use the nchar
function.
> nchar(x)
[1] 4
> nchar("hello")
[1] 5
> nchar(3)
[1] 1
> nchar(452)
[1] 3
This will not work for factor
data.
> nchar(y)
Error: 'nchar()' requires a character vector
Dealing with dates and times can be difficult in any language, and to further complicate matters R
has numerous different types of dates. The most useful are Date
and POSIXct
. Date
stores just a date while POSIXct
stores a date and time. Both objects are actually represented as the number of days (Date
) or seconds (POSIXct
) since January 1, 1970.
> date1 <- as.Date("2012-06-28")
> date1
[1] "2012-06-28"
> class(date1)
[1] "Date"
> as.numeric(date1)
[1] 15519
>
> date2 <- as.POSIXct("2012-06-28 17:42")
> date2
[1] "2012-06-28 17:42:00 EDT"
> class(date2)
[1] "POSIXct" "POSIXt"
> as.numeric(date2)
[1] 1340919720
Easier manipulation of date and time objects can be accomplished using the lubridate
and chron
packages.
Using functions such as as.numeric
or as.Date
does not merely change the formatting of an object but actually changes the underlying type.
> class(date1)
[1] "Date"
> class(as.numeric(date1))
[1] "numeric"
logical
s are a way of representing data that can be either TRUE
or FALSE
. Numerically, TRUE
is the same as 1 and FALSE
is the same as 0. So TRUE
* 5 equals 5 while FALSE
* 5 equals 0.
> TRUE * 5
[1] 5
> FALSE * 5
[1] 0
Similar to other types, logical
s have their own test, using the is.logical
function.
> k <- TRUE
> class(k)
[1] "logical"
> is.logical(k)
[1] TRUE
R
provides T
and F
as shortcuts for TRUE
and FALSE
, respectively, but it is best practice not to use them, as they are simply variables storing the values TRUE
and FALSE
and can be overwritten, which can cause a great deal of frustration as seen in the following example.
> TRUE
[1] TRUE
> T
[1] TRUE
> class(T)
[1] "logical"
> T <- 7
> T
[1] 7
> class(T)
[1] "numeric"
logical
s can result from comparing two numbers, or characters.
> # does 2 equal 3?
> 2 == 3
[1] FALSE
> # does 2 not equal three?
> 2 != 3
[1] TRUE
> # is two less than three?
> 2 < 3
[1] TRUE
> # is two less than or equal to three?
> 2 <= 3
[1] TRUE
> # is two greater than three?
> 2 > 3
[1] FALSE
> # is two greater than or equal to three?
> 2 >= 3
[1] FALSE
> # is 'data' equal to 'stats'?
> "data" == "stats"
[1] FALSE
> # is 'data' less than 'stats'?
> "data" < "stats"
[1] TRUE
A vector
is a collection of elements, all of the same type. For instance, c(1, 3, 2, 1, 5)
is a vector
consisting of the numbers 1, 3, 2, 1, 5, in that order. Similarly, c("R", "Excel", "SAS", "Excel")
is a vector
of the character
elements “R,” “Excel,” “SAS” and “Excel.” A vector
cannot be of mixed type.
vector
s play a crucial, and helpful, role in R
. More than being simple containers, vector
s in R
are special in that R
is a vectorized language. That means operations are applied to each element of the vector
automatically, without the need to loop through the vector
. This is a powerful concept that may seem foreign to people coming from other languages, but it is one of the greatest things about R
.
vector
s do not have a dimension, meaning there is no such thing as a column vector
or row vector
. These vector
s are not like the mathematical vector
where there is a difference between row and column orientation.1
1. Column or row vector
s can be represented as one-dimensional matrices
, which are discussed in Section 5.3.
The most common way to create a vector
is with c
. The “c” stands for combine because multiple elements are being combined into a vector
.
> x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> x
[1] 1 2 3 4 5 6 7 8 9 10
Now that we have a vector
of the first ten numbers, we might want to multiply each element by 3. In R
this is a simple operation using just the multiplication operator (*
).
> x * 3
[1] 3 6 9 12 15 18 21 24 27 30
No loops are necessary. Addition, subtraction and division are just as easy. This also works for any number of operations.
> x + 2
[1] 3 4 5 6 7 8 9 10 11 12
> x - 3
[1] -2 -1 0 1 2 3 4 5 6 7
> x/4
[1] 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50
> x^2
[1] 1 4 9 16 25 36 49 64 81 100
> sqrt(x)
[1] 1.000 1.414 1.732 2.000 2.236 2.449 2.646 2.828 3.000 3.162
Earlier we created a vector
of the first ten numbers using the c
function, which creates a vector
. A shortcut is the :
operator, which generates a sequence of consecutive numbers, in either direction.
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 10:1
[1] 10 9 8 7 6 5 4 3 2 1
> -2:3
[1] -2 -1 0 1 2 3
> 5:-7
[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7
Vector operations can be extended even further. Let’s say we have two vectors of equal length. Each of the corresponding elements can be operated on together.
> # create two vectors of equal length
> x <- 1:10
> y <- -5:4
> # add them
> x + y
[1] -4 -2 0 2 4 6 8 10 12 14
> # subtract them
> x - y
[1] 6 6 6 6 6 6 6 6 6 6
> # multiply them
> x * y
[1] -5 -8 -9 -8 -5 0 7 16 27 40
> # divide them--notice division by 0 results in Inf
> x/y
[1] -0.2 -0.5 -1.0 -2.0 -5.0 Inf 7.0 4.0 3.0 2.5
> # raise one to the power of the other
> x^y
[1] 1.000e+00 6.250e-02 3.704e-02 6.250e-02 2.000e-01 1.000e+00
[7] 7.000e+00 6.400e+01 7.290e+02 1.000e+04
> # check the length of each
> length(x)
[1] 10
> length(y)
[1] 10
> # the length of them added together should be the same
> length(x + y)
[1] 10
In the preceding code block, notice the hash #
symbol. This is used for comments. Anything following the hash, on the same line, will be commented out and not run.
Things get a little more complicated when operating on two vector
s of unequal length. The shorter vector
gets recycled, that is, its elements are repeated, in order, until they have been matched up with every element of the longer vector
. If the longer one is not a multiple of the shorter one, a warning is given.
> x + c(1, 2)
[1] 2 4 4 6 6 8 8 10 10 12
> x + c(1, 2, 3)
Warning: longer object length is not a multiple of shorter object
length
[1] 2 4 6 5 7 9 8 10 12 11
Comparisons also work on vector
s. Here the result is a vector of the same length containing TRUE
or FALSE
for each element.
> x <= 5
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
> x > y
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> x < y
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
To test whether all the resulting elements are TRUE
, use the all
function. Similarly, the any
function checks whether any element is TRUE
.
> x <- 10:1
> y <- -4:5
> any(x < y)
[1] TRUE
> all(x < y)
[1] FALSE
The nchar
function also acts on each element of a vector
.
> q <- c("Hockey", "Football", "Baseball", "Curling", "Rugby",
+ "Lacrosse", "Basketball", "Tennis", "Cricket", "Soccer")
> nchar(q)
[1] 6 8 8 7 5 8 10 6 7 6
> nchar(y)
[1] 2 2 2 2 1 1 1 1 1 1
Accessing individual elements of a vector
is done using square brackets ([ ]
). The first element of x
is retrieved by typing x[1]
, the first two elements by x[1:2]
and nonconsecutive elements by x[c(1, 4)]
.
> x[1]
[1] 10
> x[1:2]
[1] 10 9
> x[c(1, 4)]
[1] 10 7
This works for all types of vector
s whether they are numeric
, logical
, character
and so forth.
It is possible to give names to a vector
either during creation or after the fact.
> # provide a name for each element of an array using a name-value pair
> c(One = "a", Two = "y", Last = "r")
One Two Last
"a" "y" "r"
>
> # create a vector
> w <- 1:3
> # name the elements
> names(w) <- c("a", "b", "c")
> w
a b c
1 2 3
factor
s are an important concept in R
, especially when building models. Let’s create a simple vector
of text data that has a few repeats. We will start with the vector q
we created earlier and add some elements to it.
> q2 <- c(q, "Hockey", "Lacrosse", "Hockey", "Water Polo",
+ "Hockey", "Lacrosse")
Converting this to a factor
is easy with as.factor
.
> q2Factor <- as.factor(q2)
> q2Factor
[1] Hockey Football Baseball Curling Rugby Lacrosse
[7] Basketball Tennis Cricket Soccer Hockey Lacrosse
[13] Hockey Water Polo Hockey Lacrosse
11 Levels: Baseball Basketball Cricket Curling Football ... Water Polo
Notice that after printing out every element of q2Factor
, R
also prints the levels
of q2Factor
. The levels
of a factor
are the unique values of that factor
variable. Technically, R
is giving each unique value of a factor
a unique integer
tying it back to the character
representation. This can be seen with as.numeric
.
> as.numeric(q2Factor)
[1] 6 5 1 4 8 7 2 10 3 9 6 7 6 11 6 7
In ordinary factor
s the order of the levels
does not matter and one level
is no different from another. Sometimes, however, it is important to understand the order of a factor
, such as when coding education levels. Setting the ordered
argument to TRUE
creates an ordered factor
with the order given in the levels
argument.
> factor(x=c("High School", "College", "Masters", "Doctorate"),
+ levels=c("High School", "College", "Masters", "Doctorate"),
+ ordered=TRUE)
[1] High School College Masters Doctorate
Levels: High School < College < Masters < Doctorate
factor
s can drastically reduce the size of the variable because they are storing only the unique values, but they can cause headaches if not used properly. This will be discussed further throughout the book.
Earlier we briefly used a few basic functions like nchar
, length
and as.Date
to illustrate some concepts. Functions are very important and helpful in any language because they make code easily repeatable. Almost every step taken in R
involves using functions, so it is best to learn the proper way to call them. R
function calling is filled with a good deal of nuance, so we are going to focus on the gist of what is needed to know. Of course, throughout the book there will be many examples of calling functions.
Let’s start with the simple mean
function, which computes the average of a set of numbers. In its simplest form it takes a vector
as an argument.
> mean(x)
[1] 5.5
More complicated functions have multiple arguments that can be either specified by the order they are entered or by using their name with an equal sign. We will see further use of this throughout the book.
R
provides an easy way for users to build their own functions, which we will cover in more detail in Chapter 8.
Any function provided in R
has accompanying documentation, of varying quality of course. The easiest way to access that documentation is to place a question mark in front of the function name, like this: ?mean
.
To get help on binary operators like +, * or == surround them with back ticks (`).
> ?`+`
> ?`*`
> ?`==`
There are occasions when we have only a sense of the function we want to use. In that case we can look up the function by using part of the name with apropos
.
> apropos("mea")
[1] ".cache/mean-simple_ce29515dafe58a90a771568646d73aae"
[2] ".colMeans"
[3] ".rowMeans"
[4] "colMeans"
[5] "influence.measures"
[6] "kmeans"
[7] "mean"
[8] "mean.Date"
[9] "mean.default"
[10] "mean.difftime"
[11] "mean.POSIXct"
[12] "mean.POSIXlt"
[13] "mean_cl_boot"
[14] "mean_cl_normal"
[15] "mean_sdl"
[16] "mean_se"
[17] "rowMeans"
[18] "weighted.mean"
Missing data plays a critical role in both statistics and computing, and R
has two types of missing data, NA
and NULL
. While they are similar, they behave differently and that difference needs attention.
Often we will have data that has missing values for any number of reasons. Statistical programs use varying techniques to represent missing data such as a dash, a period or even the number 99. R
uses NA
. NA
will often be seen as just another element of a vector
. is.na
tests each element of a vector
for missingness.
> z <- c(1, 2, NA, 8, 3, NA, 3)
> z
[1] 1 2 NA 8 3 NA 3
> is.na(z)
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE
NA
is entered simply by typing the letters “N” and “A” as if they were normal text. This works for any kind of vector.
> zChar <- c("Hockey", NA, "Lacrosse")
> zChar
[1] "Hockey" NA "Lacrosse"
> is.na(zChar)
[1] FALSE TRUE FALSE
Handling missing data is an important part of statistical analysis. There are many techniques depending on field and preference. One popular technique is multiple imputation, which is discussed in detail in Chapter 25 of Andrew Gelman and Jennifer Hill’s book Data Analysis Using Regression and Multilevel/Hierarchical Models, and is implemented in the mi
, mice
and Amelia
packages.
NULL
is the absence of anything. It is not exactly missingness, it is nothingness. Functions can sometimes return NULL
and their arguments can be NULL
. An important difference between NA
and NULL
is that NULL
is atomical and cannot exist within a vector
. If used inside a vector
it simply disappears.
> z <- c(1, NULL, 3)
> z
[1] 1 3
Even though it was entered into the vector z
, it did not get stored in z
. In fact, z
is only two elements long.
The test for a NULL
value is is.null
.
> d <- NULL
> is.null(d)
[1] TRUE
> is.null(7)
[1] FALSE
Since NULL
cannot be a part of a vector
, is.null
is appropriately not vectorized.
Data come in many types, and R
is well equipped to handle them. In addition to basic calculations, R
can handle numeric, character and time-based data. One of the nicer parts of working with R
, although one that requires a different way of thinking about programming, is vectorization. This allows operating on multiple elements in a vector
simultaneously, which leads to faster and more mathematical code.