Chapter 5. Lists and Data Frames

The vectors, matrices, and arrays that we have seen so far contain elements that are all of the same type. Lists and data frames are two types that let us combine different types of data in a single variable.

Chapter Goals

After reading this chapter, you should:

  • Be able to create lists and data frames
  • Be able to use length, names, and other functions to inspect and manipulate these types
  • Understand what NULL is and when to use it
  • Understand the difference between recursive and atomic variables
  • Know how to perform basic manipulation of lists and data frames

Lists

A list is, loosely speaking, a vector where each element can be of a different type. This section concerns how to create, index, and manipulate lists.

Creating Lists

Lists are created with the list function, and specifying the contents works much like the c function that we’ve seen already. You simply list the contents, with each argument separated by a comma. List elements can be any variable type—vectors, matrices, even functions:

(a_list <- list(
  c(1, 1, 2, 5, 14, 42),    #See http://oeis.org/A000108
  month.abb,
  matrix(c(3, -8, 1, -3), nrow = 2),
  asin
))
## [[1]]
## [1]  1  1  2  5 14 42
##
## [[2]]
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
##
## [[3]]
##      [,1] [,2]
## [1,]    3    1
## [2,]   -8   -3
##
## [[4]]
## function (x)  .Primitive("asin")

As with vectors, you can name elements during construction, or afterward using the names function:

names(a_list) <- c("catalan", "months", "involutary", "arcsin")
a_list
## $catalan
## [1]  1  1  2  5 14 42
##
## $months
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
##
## $involutary
##      [,1] [,2]
## [1,]    3    1
## [2,]   -8   -3
##
## $arcsin
## function (x)  .Primitive("asin")
(the_same_list <- list(
  catalan    = c(1, 1, 2, 5, 14, 42),
  months     = month.abb,
  involutary = matrix(c(3, -8, 1, -3), nrow = 2),
  arcsin     = asin
))
## $catalan
## [1]  1  1  2  5 14 42
##
## $months
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov"
## [12] "Dec"
##
## $involutary
##      [,1] [,2]
## [1,]    3    1
## [2,]   -8   -3
##
## $arcsin
## function (x)  .Primitive("asin")

It isn’t compulsory, but it helps if the names that you give elements are valid variable names.

It is even possible for elements of lists to be lists themselves:

(main_list <- list(
  middle_list          = list(
    element_in_middle_list = diag(3),
    inner_list             = list(
      element_in_inner_list         = pi ^ 1:4,
      another_element_in_inner_list = "a"
    )
  ),
  element_in_main_list = log10(1:10)
))
## $middle_list
## $middle_list$element_in_middle_list
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1
##
## $middle_list$inner_list
## $middle_list$inner_list$element_in_inner_list
## [1] 3.142
##
## $middle_list$inner_list$another_element_in_inner_list
## [1] "a"
##
##
##
## $element_in_main_list
##  [1] 0.0000 0.3010 0.4771 0.6021 0.6990 0.7782 0.8451 0.9031 0.9542 1.0000

In theory, you can keep nesting lists forever. In practice, current versions of R will throw an error once you start nesting your lists tens of thousands of levels deep (the exact number is machine specific). Luckily, this shouldn’t be a problem for you, since real-world code where nesting is deeper than three or four levels is extremely rare.

Atomic and Recursive Variables

Due to this ability to contain other lists within themselves, lists are considered to be recursive variables. Vectors, matrices, and arrays, by contrast, are atomic. (Variables can either be recursive or atomic, never both; Appendix A contains a table explaining which variable types are atomic, and which are recursive.) The functions is.recursive and is.atomic let us test variables to see what type they are:

is.atomic(list())
## [1] FALSE
is.recursive(list())
## [1] TRUE
is.atomic(numeric())
## [1] TRUE
is.recursive(numeric())
## [1] FALSE

List Dimensions and Arithmetic

Like vectors, lists have a length. A list’s length is the number of top-level elements that it contains:

length(a_list)
## [1] 4
length(main_list) #doesn't include the lengths of nested lists
## [1] 2

Again, like vectors, but unlike matrices, lists don’t have dimensions. The dim function correspondingly returns NULL:

dim(a_list)
## NULL

nrow, NROW, and the corresponding column functions work on lists in the same way as on vectors:

nrow(a_list)
## NULL
ncol(a_list)
## NULL
NROW(a_list)
## [1] 4
NCOL(a_list)
## [1] 1

Unlike with vectors, arithmetic doesn’t work on lists. Since each element can be of a different type, it doesn’t make sense to be able to add or multiply two lists together. It is possible to do arithmetic on list elements, however, assuming that they are of an appropriate type. In that case, the usual rules for the element contents apply. For example:

l1 <- list(1:5)
l2 <- list(6:10)
l1[[1]] + l2[[1]]
## [1]  7  9 11 13 15

More commonly, you might want to perform arithmetic (or some other operation) on every element of a list. This requires looping, and will be discussed in Chapter 8.

Indexing Lists

Consider this test list:

l <- list(
  first  = 1,
  second = 2,
  third  = list(
    alpha = 3.1,
    beta  = 3.2
  )
)

As with vectors, we can access elements of the list using square brackets, [], and positive or negative numeric indices, element names, or a logical index. The following four lines of code all give the same result:

l[1:2]
## $first
## [1] 1
##
## $second
## [1] 2
l[-3]
## $first
## [1] 1
##
## $second
## [1] 2
l[c("first", "second")]
## $first
## [1] 1
##
## $second
## [1] 2
l[c(TRUE, TRUE, FALSE)]
## $first
## [1] 1
##
## $second
## [1] 2

The result of these indexing operations is another list. Sometimes we want to access the contents of the list elements instead. There are two operators to help us do this. Double square brackets ([[]]) can be given a single positive integer denoting the index to return, or a single string naming that element:

l[[1]]
## [1] 1
l[["first"]]
## [1] 1

The is.list function returns TRUE if the input is a list, and FALSE otherwise. For comparison, take a look at the two indexing operators:

is.list(l[1])
## [1] TRUE
is.list(l[[1]])
## [1] FALSE

For named elements of lists, we can also use the dollar sign operator, $. This works almost the same way as passing a named string to the double square brackets, but has two advantages. Firstly, many IDEs will autocomplete the name for you. (In R GUI, press Tab for this feature.) Secondly, R accepts partial matches of element names:

l$first
## [1] 1
l$f     #partial matching interprets "f" as "first"
## [1] 1

To access nested elements, we can stack up the square brackets or pass in a vector, though the latter method is less common and usually harder to read:

l[["third"]]["beta"]
## $beta
## [1] 3.2
l[["third"]][["beta"]]
## [1] 3.2
l[[c("third", "beta")]]
## [1] 3.2

The behavior when you try to access a nonexistent element of a list varies depending upon the type of indexing that you have used. For the next example, recall that our list, l, has only three elements.

If we use single square-bracket indexing, then the resulting list has an element with the value NULL (and name NA, if the original list has names). Compare this to bad indexing of a vector where the return value is NA:

l[c(4, 2, 5)]
## $<NA>
## NULL
##
## $second
## [1] 2
##
## $<NA>
## NULL
l[c("fourth", "second", "fifth")]
## $<NA>
## NULL
##
## $second
## [1] 2
##
## $<NA>
## NULL

Trying to access the contents of an element with an incorrect name, either with double square brackets or a dollar sign, returns NULL:

l[["fourth"]]
## NULL
l$fourth
## NULL

Finally, trying to access the contents of an element with an incorrect numerical index throws an error, stating that the subscript is out of bounds. This inconsistency in behavior is something that you just need to accept, though the best defense is to make sure that you check your indices before you use them:

l[[4]]       #this throws an error

Converting Between Vectors and Lists

Vectors can be converted to lists using the function as.list. This creates a list with each element of the vector mapping to a list element containing one value:

busy_beaver <- c(1, 6, 21, 107)  #See http://oeis.org/A060843
as.list(busy_beaver)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 6
##
## [[3]]
## [1] 21
##
## [[4]]
## [1] 107

If each element of the list contains a scalar value, then it is also possible to convert that list to a vector using the functions that we have already seen (as.numeric, as.character, and so on):

as.numeric(list(1, 6, 21, 107))
## [1]   1   6  21 107

This technique won’t work in cases where the list contains nonscalar elements. This is a real issue, because as well as storing different types of data, lists are very useful for storing data of the same type, but with a nonrectangular shape:

(prime_factors <- list(
  two   = 2,
  three = 3,
  four  = c(2, 2),
  five  = 5,
  six   = c(2, 3),
  seven = 7,
  eight = c(2, 2, 2),
  nine  = c(3, 3),
  ten   = c(2, 5)
))
## $two
## [1] 2
##
## $three
## [1] 3
##
## $four
## [1] 2 2
##
## $five
## [1] 5
##
## $six
## [1] 2 3
##
## $seven
## [1] 7
##
## $eight
## [1] 2 2 2
##
## $nine
## [1] 3 3
##
## $ten
## [1] 2 5

This sort of list can be converted to a vector using the function unlist (it is sometimes technically possible to do this with mixed-type lists, but rarely useful):

unlist(prime_factors)
##    two  three  four1  four2   five   six1   six2  seven eight1 eight2
##      2      3      2      2      5      2      3      7      2      2
## eight3  nine1  nine2   ten1   ten2
##      2      3      3      2      5

Combining Lists

The c function that we have used for concatenating vectors also works for concatenating lists:

c(list(a = 1, b = 2), list(3))
## $a
## [1] 1
##
## $b
## [1] 2
##
## [[3]]
## [1] 3

If we use it to concatenate lists and vectors, the vectors are converted to lists (as though as.list had been called on them) before the concatenation occurs:

c(list(a = 1, b = 2), 3)
## $a
## [1] 1
##
## $b
## [1] 2
##
## [[3]]
## [1] 3

It is also possible to use the cbind and rbind functions on lists, but the resulting objects are very strange indeed. They are matrices with possibly nonscalar elements, or lists with dimensions, depending upon which way you want to look at them:

(matrix_list_hybrid <- cbind(
  list(a = 1, b = 2),
  list(c = 3, list(d = 4))
))
##   [,1] [,2]
## a 1    3
## b 2    List,1
str(matrix_list_hybrid)
## List of 4
##  $ : num 1
##  $ : num 2
##  $ : num 3
##  $ :List of 1
##   ..$ d: num 4
##  - attr(*, "dim")= int [1:2] 2 2
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2] "a" "b"
##   ..$ : NULL

Using cbind and rbind in this way is something you shouldn’t do often, and probably not at all. It’s another case of R being a little too flexible and accommodating, instead of telling you that you’ve done something silly by throwing an error.

NULL

NULL is a special value that represents an empty variable. Its most common use is in lists, but it also crops up with data frames and function arguments. These other uses will be discussed later.

When you create a list, you may wish to specify that an element should exist, but should have no contents. For example, the following list contains UK bank holidays[17] for 2013 by month. Some months have no bank holidays, so we use NULL to represent this absence:

(uk_bank_holidays_2013 <- list(
  Jan = "New Year's Day",
  Feb = NULL,
  Mar = "Good Friday",
  Apr = "Easter Monday",
  May = c("Early May Bank Holiday", "Spring Bank Holiday"),
  Jun = NULL,
  Jul = NULL,
  Aug = "Summer Bank Holiday",
  Sep = NULL,
  Oct = NULL,
  Nov = NULL,
  Dec = c("Christmas Day", "Boxing Day")
))
## $Jan
## [1] "New Year's Day"
##
## $Feb
## NULL
##
## $Mar
## [1] "Good Friday"
##
## $Apr
## [1] "Easter Monday"
##
## $May
## [1] "Early May Bank Holiday" "Spring Bank Holiday"
##
## $Jun
## NULL
##
## $Jul
## NULL
##
## $Aug
## [1] "Summer Bank Holiday"
##
## $Sep
## NULL
##
## $Oct
## NULL
##
## $Nov
## NULL
##
## $Dec
## [1] "Christmas Day" "Boxing Day"

It is important to understand the difference between NULL and the special missing value NA. The biggest difference is that NA is a scalar value, whereas NULL takes up no space at all—it has length zero:

length(NULL)
## [1] 0
length(NA)
## [1] 1

You can test for NULL using the function is.null. Missing values are not null:

is.null(NULL)
## [1] TRUE
is.null(NA)
## [1] FALSE

The converse test doesn’t really make much sense. Since NULL has length zero, we have nothing to test to see if it is missing:

is.na(NULL)
## Warning: is.na() applied to non-(list or vector) of type 'NULL'
## logical(0)

NULL can also be used to remove elements of a list. Setting an element to NULL (even if it already contains NULL) will remove it. Suppose that for some reason we want to switch to an old-style Roman 10-month calendar, removing January and February:

uk_bank_holidays_2013$Jan <- NULL
uk_bank_holidays_2013$Feb <- NULL
uk_bank_holidays_2013
## $Mar
## [1] "Good Friday"
##
## $Apr
## [1] "Easter Monday"
##
## $May
## [1] "Early May Bank Holiday" "Spring Bank Holiday"
##
## $Jun
## NULL
##
## $Jul
## NULL
##
## $Aug
## [1] "Summer Bank Holiday"
##
## $Sep
## NULL
##
## $Oct
## NULL
##
## $Nov
## NULL
##
## $Dec
## [1] "Christmas Day" "Boxing Day"

To set an existing element to be NULL, we cannot simply assign the value of NULL, since that will remove the element. Instead, it must be set to list(NULL). Now suppose that the UK government becomes mean and cancels the summer bank holiday:

uk_bank_holidays_2013["Aug"] <- list(NULL)
uk_bank_holidays_2013
## $Mar
## [1] "Good Friday"
##
## $Apr
## [1] "Easter Monday"
##
## $May
## [1] "Early May Bank Holiday" "Spring Bank Holiday"
##
## $Jun
## NULL
##
## $Jul
## NULL
##
## $Aug
## NULL
##
## $Sep
## NULL
##
## $Oct
## NULL
##
## $Nov
## NULL
##
## $Dec
## [1] "Christmas Day" "Boxing Day"

Pairlists

R has another sort of list, the pairlist. Pairlists are used internally to pass arguments into functions, but you should almost never have to actively use them. Possibly the only time[18] that you are likely to explicitly see a pairlist is when using formals. That function returns a pairlist of the arguments of a function.

Looking at the help page for the standard deviation function, ?sd, we see that it takes two arguments, a vector x and a logical value na.rm, which has a default value of FALSE:

(arguments_of_sd <- formals(sd))
## $x
##
##
## $na.rm
## [1] FALSE
class(arguments_of_sd)
## [1] "pairlist"

For most practical purposes, pairlists behave like lists. The only difference is that a pairlist of length zero is NULL, but a list of length zero is just an empty list:

pairlist()
## NULL
list()
## list()

Data Frames

Data frames are used to store spreadsheet-like data. They can either be thought of as matrices where each column can store a different type of data, or nonnested lists where each element is of the same length.

Creating Data Frames

We create data frames with the data.frame function:

(a_data_frame <- data.frame(
  x = letters[1:5],
  y = rnorm(5),
  z = runif(5) > 0.5
))
##   x        y    z
## 1 a  0.17581 TRUE
## 2 b  0.06894 TRUE
## 3 c  0.74217 TRUE
## 4 d  0.72816 TRUE
## 5 e -0.28940 TRUE
class(a_data_frame)
## [1] "data.frame"

Notice that each column can have a different type than the other columns, but that all the elements within a column are the same type. Also notice that the class of the object is data.frame, with a dot rather than a space.

In this example, the rows have been automatically numbered from one to five. If any of the input vectors had names, then the row names would have been taken from the first such vector. For example, if y had names, then those would be given to the data frame:

y <- rnorm(5)
names(y) <- month.name[1:5]
data.frame(
  x = letters[1:5],
  y = y,
  z = runif(5) > 0.5
)
##          x       y     z
## January  a -0.9373 FALSE
## February b  0.7314  TRUE
## March    c -0.3030  TRUE
## April    d -1.3307 FALSE
## May      e -0.6857 FALSE

This behavior can be overridden by passing the argument row.names = NULL to the data.frame function:

data.frame(
  x = letters[1:5],
  y = y,
  z = runif(5) > 0.5,
  row.names = NULL
)
##   x       y     z
## 1 a -0.9373 FALSE
## 2 b  0.7314 FALSE
## 3 c -0.3030  TRUE
## 4 d -1.3307  TRUE
## 5 e -0.6857 FALSE

It is also possible to provide your own row names by passing a vector to row.names. This vector will be converted to character, if it isn’t already that type:

data.frame(
  x = letters[1:5],
  y = y,
  z = runif(5) > 0.5,
  row.names = c("Jackie", "Tito", "Jermaine", "Marlon", "Michael")
)
##          x       y     z
## Jackie   a -0.9373  TRUE
## Tito     b  0.7314 FALSE
## Jermaine c -0.3030  TRUE
## Marlon   d -1.3307 FALSE
## Michael  e -0.6857 FALSE

The row names can be retrieved or changed at a later date, in the same manner as with matrices, using rownames (or row.names). Likewise, colnames and dimnames can be used to get or set the column and dimension names, respectively. In fact, more or less all the functions that can be used to inspect matrices can also be used with data frames. nrow, ncol, and dim also work in exactly the same way as they do in matrices:

rownames(a_data_frame)
## [1] "1" "2" "3" "4" "5"
colnames(a_data_frame)
## [1] "x" "y" "z"
dimnames(a_data_frame)
## [[1]]
## [1] "1" "2" "3" "4" "5"
##
## [[2]]
## [1] "x" "y" "z"
nrow(a_data_frame)
## [1] 5
ncol(a_data_frame)
## [1] 3
dim(a_data_frame)
## [1] 5 3

There are two quirks that you need to be aware of. First, length returns the same value as ncol, not the total number of elements in the data frame. Likewise, names returns the same value as colnames. For clarity of code, I recommend that you avoid these two functions, and use ncol and colnames instead:

length(a_data_frame)
## [1] 3
names(a_data_frame)
## [1] "x" "y" "z"

It is possible to create a data frame by passing different lengths of vectors, as long as the lengths allow the shorter ones to be recycled an exact number of times. More technically, the lowest common multiple of all the lengths must be equal to the longest vector:

data.frame(        #lengths 1, 2, and 4 are OK
  x = 1,           #recycled 4 times
  y = 2:3,         #recycled twice
  z = 4:7          #the longest input; no recycling
)
##   x y z
## 1 1 2 4
## 2 1 3 5
## 3 1 2 6
## 4 1 3 7

If the lengths are not compatible, then an error will be thrown:

data.frame(       #lengths 1, 2, and 3 cause an error
  x = 1,          #lowest common multiple is 6, which is more than 3
  y = 2:3,
  z = 4:6
)

One other consideration when creating data frames is that by default the column names are checked to be unique, valid variable names. This feature can be turned off by passing check.names = FALSE to data.frame:

data.frame(
  "A column"   = letters[1:5],
  "!@#$%^&*()" = rnorm(5),
  "..."        = runif(5) > 0.5,
  check.names  = FALSE
)
##   A column !@#$%^&*()   ...
## 1        a    0.32940  TRUE
## 2        b   -1.81969  TRUE
## 3        c    0.22951 FALSE
## 4        d   -0.06705  TRUE
## 5        e   -1.58005  TRUE

In general, having nonstandard column names is a bad idea. Duplicating column names is even worse, since it can lead to hard-to-find bugs once you start taking subsets. Turn off the column name checking at your own peril.

Indexing Data Frames

There are lots of different ways of indexing a data frame. To start with, pairs of the four different vector indices (positive integers, negative integers, logical values, and characters) can be used in exactly the same way as with matrices. These commands both select the second and third elements of the first two columns:

a_data_frame[2:3, -3]
##   x       y
## 2 b 0.06894
## 3 c 0.74217
a_data_frame[c(FALSE, TRUE, TRUE, FALSE, FALSE), c("x", "y")]
##   x       y
## 2 b 0.06894
## 3 c 0.74217

Since more than one column was selected, the resultant subset is also a data frame. If only one column had been selected, the result would have been simplified to be a vector:

class(a_data_frame[2:3, -3])
## [1] "data.frame"
class(a_data_frame[2:3, 1])
## [1] "factor"

If we only want to select one column, then list-style indexing (double square brackets with a positive integer or name, or the dollar sign operator with a name) can also be used. These commands all select the second and third elements of the first column:

a_data_frame$x[2:3]
## [1] b c
## Levels: a b c d e
a_data_frame[[1]][2:3]
## [1] b c
## Levels: a b c d e
a_data_frame[["x"]][2:3]
## [1] b c
## Levels: a b c d e

If we are trying to subset a data frame by placing conditions on columns, the syntax can get a bit clunky, and the subset function provides a cleaner alternative. subset takes up to three arguments: a data frame to subset, a logical vector of conditions for rows to include, and a vector of column names to keep (if this last argument is omitted, then all the columns are kept). The genius of subset is that it uses special evaluation techniques to let you avoid doing some typing: instead of you having to type a_data_frame$y to access the y column of a_data_frame, it already knows which data frame to look in, so you can just type y. Likewise, when selecting columns, you don’t need to enclose the names of the columns in quotes; you can just type the names directly. In this next example, recall that | is the operator for logical or:

a_data_frame[a_data_frame$y > 0 | a_data_frame$z, "x"]
## [1] a b c d e
## Levels: a b c d e
subset(a_data_frame, y > 0 | z, x)
##   x
## 1 a
## 2 b
## 3 c
## 4 d
## 5 e

Basic Data Frame Manipulation

Like matrices, data frames can be transposed using the t function, but in the process all the columns (which become rows) are converted to the same type, and the whole thing becomes a matrix:

t(a_data_frame)
##   [,1]       [,2]       [,3]       [,4]       [,5]
## x "a"        "b"        "c"        "d"        "e"
## y " 0.17581" " 0.06894" " 0.74217" " 0.72816" "-0.28940"
## z "TRUE"     "TRUE"     "TRUE"     "TRUE"     "TRUE"

Data frames can also be joined together using cbind and rbind, assuming that they have the appropriate sizes. rbind is smart enough to reorder the columns to match. cbind doesn’t check column names for duplicates, though, so be careful with it:

another_data_frame <- data.frame(  #same cols as a_data_frame, different order
  z = rlnorm(5),                   #lognormally distributed numbers
  y = sample(5),                   #the numbers 1 to 5, in some order
  x = letters[3:7]
)
rbind(a_data_frame, another_data_frame)
##    x        y      z
## 1  a  0.17581 1.0000
## 2  b  0.06894 1.0000
## 3  c  0.74217 1.0000
## 4  d  0.72816 1.0000
## 5  e -0.28940 1.0000
## 6  c  1.00000 0.8714
## 7  d  3.00000 0.2432
## 8  e  5.00000 2.3498
## 9  f  4.00000 2.0263
## 10 g  2.00000 1.7145
cbind(a_data_frame, another_data_frame)
##   x        y    z      z y x
## 1 a  0.17581 TRUE 0.8714 1 c
## 2 b  0.06894 TRUE 0.2432 3 d
## 3 c  0.74217 TRUE 2.3498 5 e
## 4 d  0.72816 TRUE 2.0263 4 f
## 5 e -0.28940 TRUE 1.7145 2 g

Where two data frames share columns, they can be merged together using the merge function. merge provides a variety of options for doing database-style joins. To join two data frames, you need to specify which columns contain the key values to match up. By default, the merge function uses all the common columns from the two data frames, but more commonly you will just want to use a single shared ID column. In the following examples, we specify that the x column contains our IDs using the by argument:

merge(a_data_frame, another_data_frame, by = "x")
##   x     y.x  z.x    z.y y.y
## 1 c  0.7422 TRUE 0.8714   1
## 2 d  0.7282 TRUE 0.2432   3
## 3 e -0.2894 TRUE 2.3498   5
merge(a_data_frame, another_data_frame, by = "x", all = TRUE)
##   x      y.x  z.x    z.y y.y
## 1 a  0.17581 TRUE     NA  NA
## 2 b  0.06894 TRUE     NA  NA
## 3 c  0.74217 TRUE 0.8714   1
## 4 d  0.72816 TRUE 0.2432   3
## 5 e -0.28940 TRUE 2.3498   5
## 6 f       NA   NA 2.0263   4
## 7 g       NA   NA 1.7145   2

Where a data frame has all numeric values, the functions colSums and colMeans can be used to calculate the sums and means of each column, respectively. Similarly, rowSums and rowMeans calculate the sums and means of each row:

colSums(a_data_frame[, 2:3])
##     y     z
## 1.426 5.000
colMeans(a_data_frame[, 2:3])
##      y      z
## 0.2851 1.0000

Manipulating data frames is a huge topic, and is covered in more depth in Chapter 13.

Summary

  • Lists can contain different sizes and types of variables in each element.
  • Lists are recursive variables, since they can contain other lists.
  • You can index lists using [], [[]], or $.
  • NULL is a special value that can be used to create “empty” list elements.
  • Data frames store spreadsheet-like data.
  • Data frames have some properties of matrices (they are rectangular), and some of lists (different columns can contain different sorts of variables).
  • Data frames can be indexed like matrices or like lists.
  • merge lets you do database-style joins on data frames.

Test Your Knowledge: Quiz

Question 5-1
What is the length of this list?
list(alpha = 1, list(beta = 2, gamma = 3, delta = 4), eta = NULL)
## $alpha
## [1] 1
##
## [[2]]
## [[2]]$beta
## [1] 2
##
## [[2]]$gamma
## [1] 3
##
## [[2]]$delta
## [1] 4
##
##
## $eta
## NULL
Question 5-2
Where might you find a pairlist being used?
Question 5-3
Name as many ways as you can think of to create a subset of a data frame.
Question 5-4
How would you create a data frame where the column names weren’t unique, valid variable names?
Question 5-5
Which function would you use to append one data frame to another?

Test Your Knowledge: Exercises

Exercise 5-1
Create a list variable that contains all the square numbers in the range 0 to 9 in the first element, in the range 10 to 19 in the second element, and so on, up to a final element with square numbers in the range 90 to 99. Elements with no square numbers should be included! [10]
Exercise 5-2
R ships with several built-in datasets, including the famous[19] iris (flowers, not eyes) data collected by Anderson and analyzed by Fisher in the 1930s. Type iris to see the dataset. Create a new data frame that consists of the numeric columns of the iris dataset, and calculate the means of its columns. [5]
Exercise 5-3
The beaver1 and beaver2 datasets contain body temperatures of two beavers. Add a column named id to the beaver1 dataset, where the value is always 1. Similarly, add an id column to beaver2, with value 2. Vertically concatenate the two data frames and find the subset where either beaver is active. [10]


[17] Bank holidays are public holidays.

[18] R also stores some global settings in a pairlist variable called .Options in the base environment. You shouldn’t access this variable directly, but instead use the function options, which returns a list.

[19] By some definitions of fame.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset