Chapter 5. Advanced Data Structures

Sometimes data requires more complex storage than simple vectors and thankfully R provides a host of data structures. The most common are the data.frame, matrix and list followed by the array. Of these, the data.frame will be most familiar to anyone who has used a spreadsheet, the matrix to people familiar with matrix math and the list to programmers.

5.1. data.frames

Perhaps one of the most useful features of R is the data.frame. It is one of the most often cited reasons for R’s ease of use.

On the surface a data.frame is just like an Excel spreadsheet in that it has columns and rows. In statistical terms, each column is a variable and each row is an observation.

In terms of how R organizes data.frames, each column is actually a vector, each of which has the same length. That is very important because it lets each column hold a different type of data (see Section 4.3). This also implies that within a column each element must be of the same type, just like with vectors.

There are numerous ways to construct a data.frame, the simplest being to use the data.frame function. Let’s create a basic data.frame using some of the vectors we have already introduced, namely x, y and q.

> x <- 10:1
> y <- -4:5
> q <- c("Hockey", "Football", "Baseball", "Curling", "Rugby",
+        "Lacrosse", "Basketball", "Tennis", "Cricket", "Soccer")
> theDF <- data.frame(x, y, q)
> theDF

   x   y          q
1 10  -4     Hockey
2  9  -3   Football
3  8  -2   Baseball
4  7  -1    Curling
5  6   0      Rugby
6  5   1   Lacrosse
7  4   2 Basketball
8  3   3     Tennis
9  2   4    Cricket
10 1   5     Soccer

This creates a 10x3 data.frame consisting of those three vectors. Notice the names of theDF are simply the variables. We could have assigned names during the creation process, which is generally a good idea.

> theDF <- data.frame(First = x, Second = y, Sport = q)
> theDF

   First Second      Sport
1     10     -4     Hockey
2      9     -3   Football
3      8     -2   Baseball
4      7     -1    Curling
5      6      0      Rugby
6      5      1   Lacrosse
7      4      2 Basketball
8      3      3     Tennis
9      2      4    Cricket
10     1      5     Soccer

data.frames are complex objects with many attributes. The most frequently checked attributes are the number of rows and columns. Of course there are functions to do this for us: nrow and ncol. And in case both are wanted at the same time there is the dim function.

> nrow(theDF)

[1] 10

> ncol(theDF)

[1] 3

> dim(theDF)

[1] 10  3

Checking the column names of a data.frame is as simple as using the names function. This returns a character vector listing the columns. Since it is a vector we can access individual elements of it just like any other vector.

> names(theDF)

[1] "First" "Second" "Sport"

> names(theDF)[3]

[1] "Sport"

We can also check and assign the row names of a data.frame.

> rownames(theDF)

 [1]  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

> rownames(theDF) <- c("One", "Two", "Three", "Four", "Five", "Six",
+                      "Seven", "Eight", "Nine", "Ten")
> rownames(theDF)

 [1] "One"   "Two"  "Three" "Four"  "Five"  "Six"  "Seven"  "Eight"
 [9] "Nine"  "Ten"

> # set them back to the generic index
> rownames(theDF) <- NULL
> rownames(theDF)

 [1]  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Usually a data.frame has far too many rows to print them all to the screen, so thankfully the head function prints out only the first few rows.

> head(theDF)

  First Second    Sport
1    10     -4   Hockey
2     9     -3 Football
3     8     -2 Baseball
4     7     -1  Curling
5     6      0    Rugby
6     5      1 Lacrosse

> head(theDF, n = 7)

  First Second      Sport
1    10     -4     Hockey
2     9     -3   Football
3     8     -2   Baseball
4     7     -1    Curling
5     6      0      Rugby
6     5      1   Lacrosse
7     4      2 Basketball

> tail(theDF)

  First Second      Sport
5     6      0      Rugby
6     5      1   Lacrosse
7     4      2 Basketball
8     3      3     Tennis
9     2      4    Cricket
10    1      5     Soccer

As we can with other variables, we can check the class of a data.frame using the class function.

> class(theDF)

[1] "data.frame"

Since each column of the data.frame is an individual vector, it can be accessed individually and each has its own class. Like many other aspects of R, there are multiple ways to access an individual column. There is the $ operator and also the square brackets. Running theDF$Sport will give the third column in theDF. That allows us to specify one particular column by name.

> theDF$Sport

 [1] Hockey     Football   Baseball   Curling    Rugby      Lacrosse
 [7] Basketball Tennis     Cricket    Soccer
10 Levels: Baseball Basketball Cricket Curling Football ... Tennis

Similar to vectors, data.frames allow us to access individual elements by their position using square brackets, but instead of having one position two are specified. The first is the row number and the second is the column number. So to get the third row from the second column we use theDF[3, 2].

> theDF[3, 2]

[1] -2

To specify more than one row or column use a vector of indices.

> # row 3, columns 2 through 3
> theDF[3, 2:3]

  Second    Sport
3     -2 Baseball

>
> # rows 3 and 5, column 2
> # since only one column was selected it was returned as a vector

> # hence the column names will not be printed
> theDF[c(3, 5), 2]

[1] -2 0

>
> # rows 3 and 5, columns 2 through 3
> theDF[c(3, 5), 2:3]

  Second    Sport
3     -2 Baseball
5      0    Rugby

To access an entire row, specify that row while not specifying any column. Likewise, to access an entire column, specify that column while not specifying any row.

> # all of column 3
> # since it is only one column a vector is returned
> theDF[, 3]

 [1] Hockey     Football   Baseball   Curling    Rugby      Lacrosse
 [7] Basketball Tennis     Cricket    Soccer
10 Levels: Baseball Basketball Cricket Curling Football ... Tennis
>
> # all of columns 2 through 3
> theDF[, 2:3]

  Second      Sport
1     -4     Hockey
2     -3   Football
3     -2   Baseball
4     -1    Curling
5      0      Rugby
6      1   Lacrosse
7      2 Basketball
8      3     Tennis
9      4    Cricket
10     5     Soccer

>
> # all of row 2
> theDF[2, ]

  First Second    Sport
2     9     -3 Football

>
> # all of rows 2 through 4
> theDF[2:4, ]

  First Second    Sport
2     9     -3 Football
3     8     -2 Baseball
4     7     -1  Curling

To access multiple columns by name, make the column argument a character vector of the names.

> theDF[, c("First", "Sport")]

   First      Sport
1     10     Hockey
2      9   Football
3      8   Baseball
4      7    Curling
5      6      Rugby
6      5   Lacrosse
7      4 Basketball
8      3     Tennis
9      2    Cricket
10     1     Soccer

Yet another way to access a specific column is to use its column name (or its number) either as second argument to the square brackets or as the only argument to either single or double square brackets.

> # just the "Sport" column
> # since it is one column it returns as a (factor) vector
> theDF[, "Sport"]

  [1] Hockey     Football  Baseball   Curling     Rugby     Lacrosse
  [7] Basketball Tennis    Cricket    Soccer
10 Levels: Baseball Basketball Cricket Curling Football ... Tennis

> class(theDF[, "Sport"])

[1] "factor"

>
> # just the "Sport" column
> # this returns a one column data.frame
> theDF["Sport"]

        Sport
1      Hockey
2    Football
3    Baseball
4     Curling
5       Rugby
6    Lacrosse
7  Basketball
8      Tennis
9     Cricket
10     Soccer

> class(theDF["Sport"])

[1] "data.frame"

>
> # just the "Sport" column
> # this also returns a (factor) vector
> theDF[["Sport"]]

 [1] Hockey     Football   Baseball   Curling    Rugby      Lacrosse
 [7] Basketball Tennis     Cricket    Soccer
10 Levels: Baseball Basketball Cricket Curling Football ... Tennis

> class(theDF[["Sport"]])

[1] "factor"

All of these methods have differing outputs. Some return a vector, some return a single-column data.frame. To ensure a single-column data.frame while using single-square brackets, there is a third argument: drop=FALSE. This also works when specifying a single column by number.

> theDF[, "Sport", drop = FALSE]

        Sport
1      Hockey
2    Football
3    Baseball
4     Curling
5       Rugby
6    Lacrosse
7  Basketball
8      Tennis
9     Cricket
10     Soccer

> class(theDF[, "Sport", drop = FALSE])

[1] "data.frame"

>
> theDF[, 3, drop = FALSE]

        Sport
1      Hockey
2    Football
3    Baseball
4     Curling
5       Rugby
6    Lacrosse
7  Basketball
8      Tennis
9     Cricket
10     Soccer

> class(theDF[, 3, drop = FALSE])

[1] "data.frame"

In Section 4.4.2 we see that factors are stored specially. To see how they would be represented in data.frame form, use model.matrix to create a set of indicator (or dummy) variables. That is one column for each level of a factor, with a 1 if a row contains that level or a 0 otherwise.

> newFactor <- factor(c("Pennsylvania", "New York", "New Jersey", "New York",
+     "Tennessee", "Massachusetts", "Pennsylvania", "New York"))
> model.matrix(~newFactor - 1)

  newFactorMassachusetts newFactorNew Jersey newFactorNew York
1                     0                   0                 0
2                     0                   0                 1
3                     0                   1                 0
4                     0                   0                 1
5                     0                   0                 0
6                     1                   0                 0
7                     0                   0                 0
8                     0                   0                 1
  newFactorPennsylvania newFactorTennessee
1                     1                  0
2                     0                  0
3                     0                  0
4                     0                  0
5                     0                  1
6                     0                  0
7                     1                  0
8                     0                  0
attr(,"assign")
[1] 1 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$newFactor
[1] "contr.treatment"

We learn more about formulas (the argument to model.matrix) in Sections 11.2 and 12.3.2 and Chapters 15 and 16.

5.2. Lists

Often a container is needed to hold arbitrary objects of either the same type or varying types. R accomplishes this through lists. They store any number of items of any type. A list can contain all numerics or characters or a mix of the two or data.frames or, recursively, other lists.

Lists are created with the list function where each argument to the function becomes an element of the list.

> # creates a three element list
> list(1, 2, 3)

[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

>
> # creates a single element list where the only element is a vector
> # that has three elements
> list(c(1, 2, 3))

[[1]]
[1] 1 2 3

>
> # creates a two element list
> # the first element is a three element vector
> # the second element is a five element vector
> (list3 <- list(c(1, 2, 3), 3:7))

[[1]]
[1] 1 2 3

[[2]]
[1] 3 4 5 6 7
>
> # two element list
> # first element is a data.frame
> # second element is a 10 element vector
> list(theDF, 1:10)

[[1]]
   First Second      Sport
1     10     -4     Hockey
2      9     -3   Football
3      8     -2   Baseball
4      7     -1    Curling
5      6      0      Rugby
6      5      1   Lacrosse
7      4      2 Basketball
8      3      3     Tennis
9      2      4    Cricket
10     1      5     Soccer

[[2]]
 [1]  1  2  3  4  5  6  7  8  9 10

>
> # three element list
> # first is a data.frame
> # second is a vector
> # third is list3, which holds two vectors
> list5 <- list(theDF, 1:10, list3)
> list5

[[1]]
   First Second      Sport
1     10     -4     Hockey
2      9     -3   Football
3      8     -2   Baseball
4      7     -1    Curling
5      6      0      Rugby
6      5      1   Lacrosse
7      4      2 Basketball
8      3      3     Tennis
9      2      4    Cricket
10     1      5     Soccer

[[2]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[3]]
[[3]][[1]]
[1] 1 2 3

[[3]][[2]]
[1] 3 4 5 6 7

Notice in the previous block of code (where list3 was created) that enclosing an expression in parentheses displays the results after execution.

Like data.frames, lists can have names. Each element has a unique name that can be either viewed or assigned using names.


> names(list5)

NULL

> names(list5) <- c("data.frame", "vector", "list")
> names(list5)

[1] "data.frame" "vector"     "list"

> list5

$data.frame
   First Second      Sport
1     10     -4     Hockey
2      9     -3   Football
3      8     -2   Baseball
4      7     -1    Curling
5      6      0      Rugby
6      5      1   Lacrosse
7      4      2 Basketball
8      3      3     Tennis
9      2      4    Cricket
10     1      5     Soccer

$vector
 [1]  1  2  3  4  5  6  7  8  9 10

$list
$list[[1]]
[1] 1 2 3

$list[[2]]
[1] 3 4 5 6 7

Names can also be assigned to list elements during creation using name-value pairs.

> list6 <- list(TheDataFrame = theDF, TheVector = 1:10, TheList = list3)
> names(list6)

[1] "TheDataFrame" "TheVector"      "TheList"

> list6

$TheDataFrame
   First Second      Sport
1     10     -4     Hockey
2      9     -3   Football
3      8     -2   Baseball
4      7     -1    Curling
5      6      0      Rugby
6      5      1   Lacrosse
7      4      2 Basketball
8      3      3     Tennis
9      2      4    Cricket
10     1      5     Soccer

$TheVector
 [1]  1  2  3  4  5  6  7  8  9 10

$TheList
$TheList[[1]]
[1] 1 2 3

$TheList[[2]]
[1] 3 4 5 6 7

Creating an empty list of a certain size is, perhaps confusingly, done with vector.

> (emptyList <- vector(mode = "list", length = 4))

[[1]]
NULL

[[2]]
NULL

[[3]]
NULL

[[4]]
NULL

To access an individual element of a list, use double square brackets, specifying either the element number or name. Note that this allows access to only one element at a time.

> list5[[1]]

   First Second      Sport
1     10     -4     Hockey
2      9     -3   Football
3      8     -2   Baseball
4      7     -1    Curling
5      6      0      Rugby
6      5      1   Lacrosse
7      4      2 Basketball
8      3      3     Tennis
9      2      4    Cricket
10     1      5     Soccer

> list5[["data.frame"]]

   First Second      Sport
1     10     -4     Hockey
2      9     -3   Football
3      8     -2   Baseball
4      7     -1    Curling
5      6      0      Rugby
6      5      1   Lacrosse
7      4      2 Basketball
8      3      3     Tennis
9      2      4    Cricket
10     1      5     Soccer

Once an element is accessed it can be treated as if that actual element is being used, allowing nested indexing of elements.

> list5[[1]]$Sport

  [1] Hockey     Football   Baseball  Curling    Rugby      Lacrosse
  [7] Basketball Tennis     Cricket   Soccer
10 Levels: Baseball Basketball Cricket Curling Football ... Tennis

> list5[[1]][, "Second"]

 [1] -4 -3 -2 -1  0  1  2  3  4  5

> list5[[1]][, "Second", drop = FALSE]

    Second
1       -4
2       -3
3       -2
4       -1
5        0
6        1
7        2
8        3
9        4
10       5

It is possible to append elements to a list simply by using an index (either numeric or named) that does not exist.

> # see how long it currently is
> length(list5)

[1] 3

>
> # add a fourth element, unnamed
> list5[[4]] <- 2
> length(list5)

[1] 4

>
> # add a fifth element, named
> list5[["NewElement"]] <- 3:6
> length(list5)

[1] 5

>
> names(list5)

[1] "data.frame" "vector"     "list"       ""         "NewElement"

> list5

$data.frame
   First Second      Sport
1     10     -4     Hockey
2      9     -3   Football
3      8     -2   Baseball
4      7     -1    Curling
5      6      0      Rugby
6      5      1   Lacrosse
7      4      2 Basketball
8      3      3     Tennis
9      2      4    Cricket
10     1      5     Soccer

$vector
 [1]  1  2  3  4  5  6  7  8  9 10

$list
$list[[1]]
[1] 1 2 3

$list[[2]]
[1] 3 4 5 6 7

[[4]]
[1] 2

$NewElement
[1] 3 4 5 6

Occasionally appending to a list—or vector or data.frame for that matter—is fine, but doing so repeatedly is computationally expensive. So it is best to create a list as long as its final desired size and then fill it in using the appropriate indices.

5.3. Matrices

A very common mathematical structure that is essential to statistics is a matrix. This is similar to a data.frame in that it is rectangular with rows and columns except that every single element, regardless of column, must be the same type, most commonly all numerics. They also act similarly to vectors with element-by-element addition, multiplication, subtraction, division and equality. The nrow, ncol and dim functions work just like they do for data.frames.


> # create a 5x2 matrix
> A <- matrix(1:10, nrow = 5)
> # create another 5x2 matrix
> B <- matrix(21:30, nrow = 5)
> # create another 5x2 matrix
> C <- matrix(21:40, nrow = 2)
> A

     [,1] [,2]
[1,]    1    6
[2,]    2    7
[3,]    3    8
[4,]    4    9
[5,]    5   10

> B

     [,1] [,2]
[1,]   21   26
[2,]   22   27
[3,]   23   28
[4,]   24   29
[5,]   25   30

> C

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]   21   23   25   27   29   31   33   35   37    39
[2,]   22   24   26   28   30   32   34   36   38    40

> nrow(A)

[1] 5

> ncol(A)

[1] 2

> dim(A)

[1] 5 2

> # add them
> A + B

     [,1] [,2]
[1,]   22   32
[2,]   24   34
[3,]   26   36
[4,]   28   38
[5,]   30   40

> # multiply them
> A * B

     [,1] [,2]
[1,]   21  156
[2,]   44  189
[3,]   69  224
[4,]   96  261
[5,]  125  300

> # see if the elements are equal
> A == B

      [,1]  [,2]
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] FALSE FALSE

Matrix multiplication is a commonly used operation in mathematics, requiring the number of columns of the left-hand matrix to be the same as the number of rows of the right-hand matrix. Both A and B are 5X2 so we will transpose B so it can be used on the right-hand side.

> A %*% t(B)

     [,1] [,2] [,3] [,4] [,5]
[1,]  177  184  191  198  205
[2,]  224  233  242  251  260
[3,]  271  282  293  304  315
[4,]  318  331  344  357  370
[5,]  365  380  395  410  425

Another similarity with data.frames is that matrices can also have row and column names.


> colnames(A)

NULL

> rownames(A)

NULL

> colnames(A) <- c("Left", "Right")
> rownames(A) <- c("1st", "2nd", "3rd", "4th", "5th")
>
> colnames(B)

NULL

> rownames(B)

NULL

> colnames(B) <- c("First", "Second")
> rownames(B) <- c("One", "Two", "Three", "Four", "Five")
>
> colnames(C)

NULL

> rownames(C)

NULL

> colnames(C) <- LETTERS[1:10]
> rownames(C) <- c("Top", "Bottom")

There are two special vectors, letters and LETTERS, that contain the lower-case and upper-case letters, respectively.

Notice the effect when transposing a matrix and multiplying matrices. Transposing naturally flips the row and column names. Matrix multiplication keeps the row names from the left matrix and the column names from the right matrix.

> t(A)
      1st 2nd 3rd 4th 5th
Left    1   2   3   4   5
Right   6   7   8   9  10

> A %*% C

      A   B   C   D   E   F   G   H   I   J
1st 153 167 181 195 209 223 237 251 265 279
2nd 196 214 232 250 268 286 304 322 340 358
3rd 239 261 283 305 327 349 371 393 415 437
4th 282 308 334 360 386 412 438 464 490 516
5th 325 355 385 415 445 475 505 535 565 595

5.4. Arrays

An array is essentially a multidimensional vector. It must all be of the same type and individual elements are accessed in a similar fashion using square brackets. The first element is the row index, the second is the column index and the remaining elements are for outer dimensions.

> theArray <- array(1:12, dim = c(2, 3, 2))
> theArray

, , 1

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

, , 2

     [,1] [,2] [,3]
[1,]    7    9   11
[2,]    8   10   12

> theArray[1, , ]

     [,1] [,2]
[1,]    1    7
[2,]    3    9
[3,]    5   11

> theArray[1, , 1]

[1] 1 3 5

> theArray[, , 1]

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

The main difference between an array and a matrix is that matrices are restricted to two dimensions while arrays can have an arbitrary number.

5.5. Conclusion

Data come in many types and structures, which can pose a problem for some analysis environments but R handles them with aplomb. The most common data structure is the one-dimensional vector, which forms the basis of everything in R. The most powerful structure is the data.frame—something special in R that most other languages do not have—which handles mixed data types in a spreadsheet-like format. Lists are useful for storing collections of items like a hash in Perl.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset