Chapter 4. Basics of R

R is a powerful tool for all manner of calculations, data manipulation and scientific computations. Before getting to the complex operations possible in R we must start with the basics. Like most languages R has its share of mathematical capability, variables, functions and data types.

4.1. Basic Math

Being a statistical programming language, R can certainly be used to do basic math and that is where we will start.

We begin with the “Hello, World!” of basic math: 1 + 1. In the console there is a right angle bracket (>) where code should be entered. Simply test R by running

> 1 + 1

[1] 2

If this returns 2, then everything is great; if not, then something is very, very wrong. Assuming it worked, let’s look at some slightly more complicated expressions:

> 1 + 2 + 3

[1] 6

> 3 * 7 * 2

[1] 42

> 4/2

[1] 2

> 4/3

[1] 1.333

These follow the basic order of operations: Parenthesis, Exponents, Multiplication, Division, Addition and Subtraction (PEMDAS). This means operations inside parentheses take priority over other operations. Next on the priority list is exponentiation. After that multiplication and division are performed, followed by addition and subtraction.

This is why the first two lines in the following code have the same result while the third is different.

> 4 * 6 + 5

[1] 29

> (4 * 6) + 5

[1] 29

> 4 * (6 + 5)

[1] 44

So far we have put white space in between each operator such as * and /. This is not necessary but is encouraged as good coding practice.

4.2. Variables

Variables are an integral part of any programming language and R offers a great deal of flexibility. Unlike statically typed languages such as C++, R does not require variable types to be declared. A variable can take on any available data type as described in Section 4.3. It can also hold any R object such as a function, the result of an analysis or a plot. A single variable can at one point hold a number, then later hold a character and then later a number again.

4.2.1. Variable Assignment

There are a number of ways to assign a value to a variable, and again, this does not depend on the type of value being assigned.

The valid assignment operators are <- and = with the first being preferred.

For example, let’s save 2 to the variable x and 5 to the variable y.

> x <- 2
> x

[1] 2

> y = 5
> y

[1] 5

The arrow operator can also point in the other direction.

> 3 <- z
> z

[1] 3

The assignment operation can be used successively to assign a value to multiple variables simultaneously.

> a <- b <- 7
> a

[1] 7

> b

[1] 7

A more laborious, though sometimes necessary, way to assign variables is to use the assign function.

> assign("j", 4)
> j

[1] 4

Variable names can contain any combination of alphanumeric characters along with periods (.) and underscores (_). However, they cannot start with a number or an underscore.

The most common form of assignment in the R community is the left arrow (<-), which may seem awkward to use at first but eventually becomes second nature. It even seems to make sense, as the variable is sort of pointing to its value. There is also a particularly nice benefit for people coming from languages like SQL, where a single equal sign (=) tests for equality.

It is generally considered best practice to use actual names, usually nouns, for variables instead of single letters. This provides more information to the person reading the code. This is seen throughout this book.

4.2.2. Removing Variables

For various reasons a variable may need to be removed. This is easily done using remove or its shortcut rm.

> j

[1] 4

> rm(j)
> # now it is gone
> j

Error: object 'j' not found

This frees up memory so that R can store more objects, although it does not necessarily free up memory for the operating system. To guarantee that, use gc, which performs garbage collection, releasing unused memory to the operating system. R automatically does garbage collection periodically, so this function is not essential.

Variable names are case sensitive, which can trip up people coming from a language like SQL or Visual Basic.

> theVariable <- 17
> theVariable

[1] 17

> THEVARIABLE

Error: object 'THEVARIABLE' not found

4.3. Data Types

There are numerous data types in R that store various kinds of data. The four main types of data most likely to be used are numeric, character (string), Date/POSIXct (time-based) and logical (TRUE/FALSE).

The type of data contained in a variable is checked with the class function.

> class(x)

[1] "numeric"

4.3.1. Numeric Data

As expected, R excels at running numbers, so numeric data is the most common type in R. The most commonly used numeric data is numeric. This is similar to a float or double in other languages. It handles integers and decimals, both positive and negative, and, of course, zero. A numeric value stored in a variable is automatically assumed to be numeric. Testing whether a variable is numeric is done with the function is.numeric.

> is.numeric(x)

[1] TRUE

Another important, if less frequently used, type is integer. As the name implies this is for whole numbers only, no decimals. To set an integer to a variable it is necessary to append the value with an L. As with checking for a numeric, the is.integer function is used.

> i <- 5L
> i

[1] 5

> is.integer(i)

[1] TRUE

Do note that, even though i is an integer, it will also pass a numeric check.

> is.numeric(i)

[1] TRUE

R nicely promotes integers to numeric when needed. This is obvious when multiplying an integer by a numeric, but importantly it works when dividing an integer by another integer, resulting in a decimal number.

> class(4L)

[1] "integer"

> class(2.8)

[1] "numeric"

> 4L * 2.8

[1] 11.2

> class(4L * 2.8)

[1] "numeric"

>
> class(5L)

[1] "integer"

> class(2L)

[1] "integer"

> 5L/2L

[1] 2.5

> class(5L/2L)

[1] "numeric"

4.3.2. Character Data

Even though it is not explicitly mathematical, the character (string) data type is very common in statistical analysis and must be handled with care. R has two primary ways of handling character data: character and factor. While they may seem similar on the surface, they are treated quite differently.

> x <- "data"
> x

[1] "data"

> y <- factor("data")
> y

[1] data
Levels: data

Notice that x contains the word “data” encapsulated in quotes, while y has the word “data” without quotes and a second line of information about the levels of y. That is explained further in Section 4.4.2 about vectors.

Characters are case sensitive, so “Data” is different from “data” or “DATA.”

To find the length of a character (or numeric) use the nchar function.

> nchar(x)

[1] 4

> nchar("hello")

[1] 5

> nchar(3)

[1] 1

> nchar(452)

[1] 3

This will not work for factor data.

> nchar(y)

Error: 'nchar()' requires a character vector

4.3.3. Dates

Dealing with dates and times can be difficult in any language, and to further complicate matters R has numerous different types of dates. The most useful are Date and POSIXct. Date stores just a date while POSIXct stores a date and time. Both objects are actually represented as the number of days (Date) or seconds (POSIXct) since January 1, 1970.

> date1 <- as.Date("2012-06-28")
> date1

[1] "2012-06-28"

> class(date1)

[1] "Date"

> as.numeric(date1)

[1] 15519

>
> date2 <- as.POSIXct("2012-06-28 17:42")
> date2

[1] "2012-06-28 17:42:00 EDT"

> class(date2)

[1] "POSIXct" "POSIXt"

> as.numeric(date2)

[1] 1340919720

Easier manipulation of date and time objects can be accomplished using the lubridate and chron packages.

Using functions such as as.numeric or as.Date does not merely change the formatting of an object but actually changes the underlying type.

> class(date1)

[1] "Date"

> class(as.numeric(date1))

[1] "numeric"

4.3.4. Logical

logicals are a way of representing data that can be either TRUE or FALSE. Numerically, TRUE is the same as 1 and FALSE is the same as 0. So TRUE * 5 equals 5 while FALSE * 5 equals 0.

> TRUE * 5

[1] 5

> FALSE * 5

[1] 0

Similar to other types, logicals have their own test, using the is.logical function.

> k <- TRUE
> class(k)

[1] "logical"

> is.logical(k)

[1] TRUE

R provides T and F as shortcuts for TRUE and FALSE, respectively, but it is best practice not to use them, as they are simply variables storing the values TRUE and FALSE and can be overwritten, which can cause a great deal of frustration as seen in the following example.

> TRUE

[1] TRUE

> T

[1] TRUE

> class(T)

[1] "logical"

> T <- 7
> T

[1] 7

> class(T)

[1] "numeric"

logicals can result from comparing two numbers, or characters.

> # does 2 equal 3?
> 2 == 3

[1] FALSE

> # does 2 not equal three?
> 2 != 3

[1] TRUE

> # is two less than three?
> 2 < 3

[1] TRUE

> # is two less than or equal to three?
> 2 <= 3

[1] TRUE

> # is two greater than three?
> 2 > 3

[1] FALSE

> # is two greater than or equal to three?
> 2 >= 3

[1] FALSE

> # is 'data' equal to 'stats'?
> "data" == "stats"

[1] FALSE

> # is 'data' less than 'stats'?
> "data" < "stats"

[1] TRUE

4.4. Vectors

A vector is a collection of elements, all of the same type. For instance, c(1, 3, 2, 1, 5) is a vector consisting of the numbers 1, 3, 2, 1, 5, in that order. Similarly, c("R", "Excel", "SAS", "Excel") is a vector of the character elements “R,” “Excel,” “SAS” and “Excel.” A vector cannot be of mixed type.

vectors play a crucial, and helpful, role in R. More than being simple containers, vectors in R are special in that R is a vectorized language. That means operations are applied to each element of the vector automatically, without the need to loop through the vector. This is a powerful concept that may seem foreign to people coming from other languages, but it is one of the greatest things about R.

vectors do not have a dimension, meaning there is no such thing as a column vector or row vector. These vectors are not like the mathematical vector where there is a difference between row and column orientation.1

1. Column or row vectors can be represented as one-dimensional matrices, which are discussed in Section 5.3.

The most common way to create a vector is with c. The “c” stands for combine because multiple elements are being combined into a vector.

> x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> x

 [1]  1  2  3  4  5  6  7  8  9 10

4.4.1. Vector Operations

Now that we have a vector of the first ten numbers, we might want to multiply each element by 3. In R this is a simple operation using just the multiplication operator (*).

> x * 3

 [1]  3  6  9 12 15 18 21 24 27 30

No loops are necessary. Addition, subtraction and division are just as easy. This also works for any number of operations.

> x + 2

 [1]  3  4  5  6  7  8  9 10 11 12

> x - 3

 [1] -2 -1  0  1  2  3  4  5  6  7

> x/4

 [1] 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25 2.50

> x^2

 [1]   1   4   9  16  25  36  49  64  81 100

> sqrt(x)

 [1] 1.000 1.414 1.732 2.000 2.236 2.449 2.646 2.828 3.000 3.162

Earlier we created a vector of the first ten numbers using the c function, which creates a vector. A shortcut is the : operator, which generates a sequence of consecutive numbers, in either direction.

> 1:10

 [1]  1  2  3  4  5  6  7  8  9 10

> 10:1

 [1] 10  9  8  7  6  5  4  3  2  1

> -2:3

[1] -2 -1  0  1  2  3

> 5:-7

 [1]  5  4  3  2  1  0 -1 -2 -3 -4 -5 -6 -7

Vector operations can be extended even further. Let’s say we have two vectors of equal length. Each of the corresponding elements can be operated on together.

> # create two vectors of equal length
> x <- 1:10
> y <- -5:4
> # add them
> x + y

 [1] -4 -2  0  2  4  6  8 10 12 14

> # subtract them
> x - y

 [1] 6 6 6 6 6 6 6 6 6 6

> # multiply them
> x * y

 [1] -5 -8 -9 -8 -5 0 7 16 27 40

> # divide them--notice division by 0 results in Inf
> x/y

 [1] -0.2 -0.5 -1.0 -2.0 -5.0 Inf 7.0 4.0 3.0 2.5

> # raise one to the power of the other
> x^y

 [1] 1.000e+00 6.250e-02 3.704e-02 6.250e-02 2.000e-01 1.000e+00
 [7] 7.000e+00 6.400e+01 7.290e+02 1.000e+04

> # check the length of each
> length(x)

[1] 10

> length(y)

[1] 10

> # the length of them added together should be the same
> length(x + y)

[1] 10

In the preceding code block, notice the hash # symbol. This is used for comments. Anything following the hash, on the same line, will be commented out and not run.

Things get a little more complicated when operating on two vectors of unequal length. The shorter vector gets recycled, that is, its elements are repeated, in order, until they have been matched up with every element of the longer vector. If the longer one is not a multiple of the shorter one, a warning is given.

> x + c(1, 2)

 [1]  2  4  4  6  6  8  8 10 10 12

> x + c(1, 2, 3)

Warning: longer object length is not a multiple of shorter object
length

 [1] 2  4  6  5  7  9  8 10 12 11

Comparisons also work on vectors. Here the result is a vector of the same length containing TRUE or FALSE for each element.

> x <= 5

 [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

> x > y

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

> x < y

 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

To test whether all the resulting elements are TRUE, use the all function. Similarly, the any function checks whether any element is TRUE.

> x <- 10:1
> y <- -4:5
> any(x < y)

[1] TRUE

> all(x < y)

[1] FALSE

The nchar function also acts on each element of a vector.

> q <- c("Hockey", "Football", "Baseball", "Curling", "Rugby",
+        "Lacrosse", "Basketball", "Tennis", "Cricket", "Soccer")
> nchar(q)

 [1]  6  8  8  7  5  8 10  6  7  6

> nchar(y)

 [1] 2 2 2 2 1 1 1 1 1 1

Accessing individual elements of a vector is done using square brackets ([ ]). The first element of x is retrieved by typing x[1], the first two elements by x[1:2] and nonconsecutive elements by x[c(1, 4)].

> x[1]

[1] 10

> x[1:2]

[1] 10 9

> x[c(1, 4)]

[1] 10  7

This works for all types of vectors whether they are numeric, logical, character and so forth.

It is possible to give names to a vector either during creation or after the fact.

> # provide a name for each element of an array using a name-value pair
> c(One = "a", Two = "y", Last = "r")

 One Two Last
 "a" "y" "r"

>
> # create a vector
> w <- 1:3
> # name the elements
> names(w) <- c("a", "b", "c")
> w

a b c
1 2 3

4.4.2. Factor Vectors

factors are an important concept in R, especially when building models. Let’s create a simple vector of text data that has a few repeats. We will start with the vector q we created earlier and add some elements to it.

> q2 <- c(q, "Hockey", "Lacrosse", "Hockey", "Water Polo",
+         "Hockey", "Lacrosse")

Converting this to a factor is easy with as.factor.

> q2Factor <- as.factor(q2)
> q2Factor

 [1] Hockey     Football   Baseball   Curling    Rugby      Lacrosse
 [7] Basketball Tennis     Cricket    Soccer     Hockey     Lacrosse
[13] Hockey     Water Polo Hockey     Lacrosse
11 Levels: Baseball Basketball Cricket Curling Football ... Water Polo

Notice that after printing out every element of q2Factor, R also prints the levels of q2Factor. The levels of a factor are the unique values of that factor variable. Technically, R is giving each unique value of a factor a unique integer tying it back to the character representation. This can be seen with as.numeric.

> as.numeric(q2Factor)

 [1]  6  5  1  4  8  7  2 10  3  9  6  7  6 11  6  7

In ordinary factors the order of the levels does not matter and one level is no different from another. Sometimes, however, it is important to understand the order of a factor, such as when coding education levels. Setting the ordered argument to TRUE creates an ordered factor with the order given in the levels argument.

> factor(x=c("High School", "College", "Masters", "Doctorate"),
+        levels=c("High School", "College", "Masters", "Doctorate"),
+        ordered=TRUE)

[1] High School College     Masters     Doctorate
Levels: High School < College < Masters < Doctorate

factors can drastically reduce the size of the variable because they are storing only the unique values, but they can cause headaches if not used properly. This will be discussed further throughout the book.

4.5. Calling Functions

Earlier we briefly used a few basic functions like nchar, length and as.Date to illustrate some concepts. Functions are very important and helpful in any language because they make code easily repeatable. Almost every step taken in R involves using functions, so it is best to learn the proper way to call them. R function calling is filled with a good deal of nuance, so we are going to focus on the gist of what is needed to know. Of course, throughout the book there will be many examples of calling functions.

Let’s start with the simple mean function, which computes the average of a set of numbers. In its simplest form it takes a vector as an argument.

> mean(x)

[1] 5.5

More complicated functions have multiple arguments that can be either specified by the order they are entered or by using their name with an equal sign. We will see further use of this throughout the book.

R provides an easy way for users to build their own functions, which we will cover in more detail in Chapter 8.

4.6. Function Documentation

Any function provided in R has accompanying documentation, of varying quality of course. The easiest way to access that documentation is to place a question mark in front of the function name, like this: ?mean.

To get help on binary operators like +, * or == surround them with back ticks (`).

> ?`+`
> ?`*`
> ?`==`

There are occasions when we have only a sense of the function we want to use. In that case we can look up the function by using part of the name with apropos.

> apropos("mea")

 [1] ".cache/mean-simple_ce29515dafe58a90a771568646d73aae"
 [2] ".colMeans"
 [3] ".rowMeans"
 [4] "colMeans"
 [5] "influence.measures"
 [6] "kmeans"
 [7] "mean"
 [8] "mean.Date"
 [9] "mean.default"
[10] "mean.difftime"
[11] "mean.POSIXct"
[12] "mean.POSIXlt"
[13] "mean_cl_boot"
[14] "mean_cl_normal"
[15] "mean_sdl"
[16] "mean_se"
[17] "rowMeans"
[18] "weighted.mean"

4.7. Missing Data

Missing data plays a critical role in both statistics and computing, and R has two types of missing data, NA and NULL. While they are similar, they behave differently and that difference needs attention.

4.7.1. NA

Often we will have data that has missing values for any number of reasons. Statistical programs use varying techniques to represent missing data such as a dash, a period or even the number 99. R uses NA. NA will often be seen as just another element of a vector. is.na tests each element of a vector for missingness.

> z <- c(1, 2, NA, 8, 3, NA, 3)
> z

[1] 1 2 NA 8 3 NA 3

> is.na(z)

[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE

NA is entered simply by typing the letters “N” and “A” as if they were normal text. This works for any kind of vector.

> zChar <- c("Hockey", NA, "Lacrosse")
> zChar

[1] "Hockey" NA           "Lacrosse"

> is.na(zChar)

[1] FALSE TRUE FALSE

Handling missing data is an important part of statistical analysis. There are many techniques depending on field and preference. One popular technique is multiple imputation, which is discussed in detail in Chapter 25 of Andrew Gelman and Jennifer Hill’s book Data Analysis Using Regression and Multilevel/Hierarchical Models, and is implemented in the mi, mice and Amelia packages.

4.7.2. NULL

NULL is the absence of anything. It is not exactly missingness, it is nothingness. Functions can sometimes return NULL and their arguments can be NULL. An important difference between NA and NULL is that NULL is atomical and cannot exist within a vector. If used inside a vector it simply disappears.

> z <- c(1, NULL, 3)
> z

[1] 1 3

Even though it was entered into the vector z, it did not get stored in z. In fact, z is only two elements long.

The test for a NULL value is is.null.

> d <- NULL
> is.null(d)

[1] TRUE

> is.null(7)

[1] FALSE

Since NULL cannot be a part of a vector, is.null is appropriately not vectorized.

4.8. Conclusion

Data come in many types, and R is well equipped to handle them. In addition to basic calculations, R can handle numeric, character and time-based data. One of the nicer parts of working with R, although one that requires a different way of thinking about programming, is vectorization. This allows operating on multiple elements in a vector simultaneously, which leads to faster and more mathematical code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset