Hour 3. Single-Mode Data Structures


What You’ll Learn in This Hour:

Image The common R data types

Image What a vector object is

Image What a matrix object is

Image What an array object is


R is commonly used to gain insight from data, using graphical or analytic methods. To use R effectively, you must have a good working knowledge of the basic data structures. In this hour, we describe the standard types of data found in R and introduce three key structures that can be used to store these data types: vectors, matrices, and arrays. We will look at the ways in which these structures can be created and managed, with a focus on how to extract data from them.

The R Data Types

Four standard types of data can be used in R. These data types, or “modes” as they are formally known as, are as follows:

Image Numeric values (integers or continuous values)

Image Character strings

Image Logical values (TRUE and FALSE values)

Image Complex numbers (with real and imaginary parts)

The following code shows examples of each of these data types:

> 4 + 5     # numeric
[1] 9
> "Hello"   # character
[1] "Hello"
> 4 > 5     # logical (is 4 greater than 5)
[1] FALSE
> 3 + 4i    # complex
[1] 3+4i


Note: Quotation Marks

Note the use of the double quotation marks for specifying character data. You may use either double or single quotation marks (but not both at the same time).


The mode Function

In the last section you saw examples of the four “modes” of data within R. You can use the mode function directly to discover the mode of data held in any object, as illustrated in the following example:

> X <- 4 + 5      # Assign a (numeric) value to X
> X               # Print the value of X
[1] 9
> mode(X)         # The mode of X
[1] "numeric"

> X < 10          # Logical statement: is X less than 10?
[1] TRUE
> mode(X < 10)    # The mode of this data
[1] "logical"


Note: Missing Values

In R, any missing value is represented with an “NA” symbol. This can be a “missing” numeric, character, logical, or complex value.


Vectors, Matrices, and Arrays

In R, there are three data structures designed to store a single type of data. These structures are known as “single-mode” data structures:

Image Vectors—Series of values

Image Matrices—Rectangular structures with rows and columns

Image Arrays—Higher dimension structures (for example, 3D and 4D arrays)

Given that these are single-mode structures, they may only hold a single type of data. Therefore, you may have a numeric vector or a character matrix, for example, but you cannot create an array that contains both numeric and logical data.

Vectors

A vector is a series of values of the same mode—it is the basic form of R structure, and most functions in R are ultimately designed to operate on vectors. In this section, we look at the following:

Image Some ways to create vectors

Image The attributes of a vector

Image The ways in which you can extract information from a vector

Creating Vectors

There are many ways to create vectors in the R language, and many functions will return vectors as an output (such as the set of functions that create random samples from statistical distributions, which you’ll see later in Hour 6, “Common R Utility Functions”). In this section, we focus on four ways to create simple vectors.

Combining Elements with the c Function

The c function allows you to create simple vectors by combining elements of the same mode. (Note that c is lowercase!) You specify as many elements as you want, separated by commas, optionally saving the results as objects for reuse later. Here’s an example:

> numericVector <- c(2, 6, 8, 4, 2, 9, 4, 0)  # Vector of numerics
> numericVector              # Print the numeric vector
[1] 2 6 8 4 2 9 4 0
> mode(numericVector)        # What is the mode of "numericVector"?
[1] "numeric"

> c("Hello", "There")        # Vector of characters
[1] "Hello" "There"
> c(F, T, T, F, F, T, F, F)  # Vector of logicals
[1] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
> c(3+4i, 5+9i, 3+7i)        # Vector complex numbers
[1] 3+4i 5+9i 3+7i


Note: Logical Values

You specify logical values without quotation marks, using either T and F or TRUE and FALSE, as shown here:

> c(T, F, TRUE, FALSE)
[1]  TRUE FALSE  TRUE FALSE


You can use the c function to combine single values, or even vectors of values (because a single value is actually a vector of length 1). In this way, you can combine vectors, as illustrated here:

> X <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)   # Create a simple vector of numerics
> X                                       # Print the vector
 [1]  1  2  3  4  5  6  7  8  9 10
> c(X, X, X, X, X)                        # Combine vectors
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4
[25]  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8
[49]  9 10


Note: Indexed Printing

When you print vectors in R, you see that the values are prefixed with [1]. This specifies an index for the values in the vector. If you print a vector with many elements, it is clearer to see this indexing behavior. In the preceding example, the first 5 on the second “line” of printing is the 25th value in the vector, as noted by the [25] that precedes it.

Although the horizontal printing of the vector may encourage you to think of a vector as a “row” of data, this is just a printing convention. In fact, a vector has no structure: It is simply a series of values.



Tip: Multi-Mode Structures

Earlier, we stated that vectors are strictly single-mode structures—that is, they contain only values of a single data type. If you try to create vectors containing more than one mode of data, R coerces the vector to a single mode, as shown here:

> c(1, 2, 3, "Hello")                 # Multiple modes
[1] "1"     "2"     "3"     "Hello"
> c(1, 2, 3, TRUE, FALSE)             # Multiple modes
[1] 1 2 3 1 0
> c(1, 2, 3, TRUE, FALSE, "Hello")    # Multiple modes
[1] "1"     "2"     "3"       "TRUE"  "FALSE" "Hello"


Creating a Sequence of “Integers”

In the previous section, we looked at the use of the c function to create vectors. In one of the examples, we created a sequence of integers:

> X <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)   # Create a simple vector of numerics
> X                                       # Print the vector
 [1]  1  2  3  4  5  6  7  8  9 10

This is a simple line of code that creates a sequence of values from 1 to 10, “by 1.” However, if you wanted to create a sequence of integer values from 1 to 100, this would require significantly more typing! If you do wish to create a series of integers, you can use the : symbol, specifying the start and end values, as follows:

> 1:100   # Series of values from 1 to 100
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

In fact, the : notation can be used to create any sequence of numeric values from one number to another number, “by 1,” as shown in the following examples:

> 1:5
[1] 1 2 3 4 5
> 5:1
[1] 5 4 3 2 1
> -1:1
[1] -1  0  1
> 1.3:5.3
[1] 1.3 2.3 3.3 4.3 5.3

You can combine R statements, such as those in the last two sections, to create more complex vectors. Here is an example of the c function and the : notation used together to create a symmetric pattern of values:

> c(0:4, 5, 4:0)
 [1] 0 1 2 3 4 5 4 3 2 1 0

You can operate on vectors to create sequences where the “gap” in the sequence is not one. For example, this line of code would create a series of values from 2 to 20, “by 2”:

> 2*1:10
 [1]  2  4  6  8 10 12 14 16 18 20

This works well for simple sequences, such as the one illustrated here. However, for more complex sequences of numeric values (for example, 1.3 to 8.4, by 0.3), you need a more general approach.


Note: Letter Sequences

In this section we have looked at regular series of numeric (primarily integer) values. This approach works only for numeric values. For example, you cannot create a series of letters using syntax such as A:Z. You will, however, see how to achieve letter sequences in the “Subscripting Vectors” section, later in this hour.


Creating a Sequence of Numeric Values with the seq Function

In the preceding section, we used the : notation to create a series of numeric values, where the “gap” in the sequence is one. A more general way of performing the same operation is with the seq function. The first two arguments to seq are the starting and ending values, and the default gap is one. Therefore, the following lines are equivalent:

> 1:10
 [1]  1  2  3  4  5  6  7  8  9 10
> seq(1, 10)
 [1]  1  2  3  4  5  6  7  8  9 10

The advantage of using the seq function is that it has an additional argument, by, that allows you to specify the gap between consecutive sequence values, as shown in the following examples:

> seq(1, 10, by = 0.5)   # Sequence from 1 to 10 by 0.5
 [1]  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5
[13]  7.0  7.5  8.0  8.5  9.0  9.5 10.0

> seq(2, 20, by = 2)     # Sequence from 2 to 20 by 2
 [1]  2  4  6  8 10 12 14 16 18 20

> seq(5, -5, by = -2)    # Sequence from 5 to -5 by -2
[1]  5  3  1 -1 -3 -5

These examples illustrate some simple sequences of values. However, let’s consider the following examples, where we create a sequence of values from 1.3 to 8.4 by 0.3:

> seq(1.3, 8.4, by = 0.3)  # Sequence from 1.3 to 8.4 by 0.3
 [1] 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7 4.0 4.3 4.6 4.9 5.2 5.5
[16] 5.8 6.1 6.4 6.7 7.0 7.3 7.6 7.9 8.2

In this example, note that the last value in the vector is 8.2, whereas we requested a sequence from 1.3 to 8.4. Of course, the reason that the last value is not precisely 8.4 is that the difference between the start and end of the sequence is not divisible by 0.3 (the specified “gap”).

If instead we wanted to create a sequence of values from a start point to a particular end point, we could specify a length of the output vector instead of the gap in consecutive sequence values:

> seq(1.3, 8.4, length = 10)  # Sequence of 10 values from 1.3 to 8.4
 [1] 1.300000 2.088889 2.877778 3.666667 4.455556 5.244444
 [7] 6.033333 6.822222 7.611111 8.400000

Creating a Sequence of Repeated Values

In the earlier section “Combining Elements with the c Function,” we created a repeated sequence of values by combining a created vector a number of times:

> X <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)   # Create a simple vector of numerics
> X                                       # Print the vector
 [1]  1  2  3  4  5  6  7  8  9 10

> c(X, X, X, X, X)                        # Combine vectors
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4
[25]  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8
[49]  9 10

We can use the rep function in R to create a vector containing repeated values. The first two arguments to the rep function are the value(s) to repeat and the number of times to repeat the value(s), as shown here:

> rep("Hello", 5)  # Repeat "Hello" 5 times
[1] "Hello" "Hello" "Hello" "Hello" "Hello"

In the last example, we are repeating a single value, but the first argument to rep could be a vector of values. In this way, we could re-create the earlier vector of repeated sequences (where we used the c function to combine multiple instances of a vector) using rep, as follows:

> X <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
> rep(X, 5)        # Repeat the X vector 5 times
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4
[25]  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8
[49]  9 10

You saw in the earlier section “Creating a Sequence of Integers” that you can create a series of integers with the : notation. Therefore, we can further simplify this example as follows:

> X <- 1:10
> rep(X, 5)        # Repeat the X vector 5 times
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4
[25]  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8
[49]  9 10

Or even:

> rep(1:10, 5)      # Repeat 1:10 5 times
 [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4
[25]  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8
[49]  9 10

In these examples, we repeat a series of values a specific number of times. Alternatively, we can repeat each of the values a specified number of times by supplying a vector value for the second argument the same length as that in the first argument:

> rep( c("A", "B", "C"), c(4, 1, 3))
[1] "A" "A" "A" "A" "B" "C" "C" "C"

In this example, we repeat “A” four times, “B” once, and “C” three times. Using this same approach, we can replace each value of a vector a specific number of times, as shown here:

> rep( c("A", "B", "C"), c(3, 3, 3))
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"

Alternatively, because the second input is a repeated set of values, this could be written as follows:

> rep( c("A", "B", "C"), rep(3, 3))
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"

However, an argument to rep called each provides an easy way to achieve the same result:

> rep( c("A", "B", "C"), each = 3)
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"

As you can see, the rep function can be used to create a variety of vectors with repeated sequences. Let’s quickly recap the three ways of using rep, as illustrated in this section:

> rep( c("A", "B", "C"), 3)           # Repeat the vector 3 times
[1] "A" "B" "C" "A" "B" "C" "A" "B" "C"

> rep( c("A", "B", "C"), c(4, 1, 3))  # Repeat each value a specific number of
                                        times
[1] "A" "A" "A" "A" "B" "C" "C" "C"

> rep( c("A", "B", "C"), each = 3)    # Repeat each value 3 times
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"


Caution: Nested Calls

The last section included the following line of code:

> rep( c("A", "B", "C"), rep(3, 3))
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"

This is possibly the most complex line of code you’ve seen so far, and includes nested calls: The inputs to rep are, themselves, derived from calls to functions (c and rep, respectively). This sort of syntax is common in R, but care must be taken not to create overly complex nested calls because this may make your code hard to read and understand later. Where appropriate, consider breaking the code into smaller, commented fragments, as shown here:

> theVector <- c("A", "B", "C")  # Vector to repeat
> repTimes <- rep(3, 3)          # Number of times to repeat the vector
> rep(theVector, repTimes)       # Repeat the vector
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"


Vector Attributes

A vector has a number of attributes that you can query using a set of simple functions. Specifically, you can query a vector’s length, mode, and element names.

The mode function you saw earlier in this hour takes a vector input and returns the mode of the data it contains. Here’s an example:

> X <- c(6, 8, 3, 1, 7)  # Create a simple vector
> X                      # Print the vector
[1] 6 8 3 1 7
> mode(X)                # The mode of the vector
[1] "numeric"

If you want to see the number of elements in a vector, you can use the length function:

> length(X)              # Number of elements
[1] 5


Note: Missing Values

If we have a vector that contains one or more missing values, these values will still contribute to the vector’s length:

> Y <- c(4, 5, NA, 1, NA, 0)
> Y
[1]  4  5 NA  1 NA  0
> length(Y)
[1] 6

The third and fifth elements of the preceding vector exist—we just don’t know their values.


A vector can also have elements you can query using the names function. (Note that we did not specify names for the vector created earlier.) Here’s an example:

> X <- c(6, 8, 3, 1, 7)  # Create a simple vector
> X                      # Print the vector
[1] 6 8 3 1 7
> names(X)               # Element names of X
NULL

In R, NULL signifies an empty structure. So here, the result of the call to the names function tells us that this vector has no element names. We come across vectors with element names in one of two ways: either as the result of a call to a function or when we assign names directly.

Consider an example where we have created a frequency count of men and women in a set of data. These numbers could be returned as a vector, as shown next:

> genderFreq   # Frequency by gender
[1] 165 147

Here, we see that the vector contains two values (165 and 147) that relate to the frequency count by gender. However, without labels, we do not know which value refers to which gender. As such, R may return a named vector, as shown here:

> genderFreq
Female   Male
   165    147

If we want to create a vector with named elements, we can specify names for the elements as we create the vector or assign names using the names function itself:

> genderFreq <- c(Female = 165, Male = 147)  # Create a vector with element names
> genderFreq
Female   Male
   165    147

> genderFreq <- c(165, 147)                  # Create a vector with no element
                                               names
> genderFreq
[1] 165 147
> names(genderFreq) <- c("Female", "Male")   # Assign element names
> genderFreq
Female   Male
   165    147

When we encounter a “named” vector, we can query it with the names function to return the (character) vector of element names:

> genderFreq           # Print the vector
Female   Male
   165    147
> names(genderFreq)    # Return the element names
[1] "Female" "Male"

To summarize, the three primary functions used to query vector attributes are listed in Table 3.1.

Image

TABLE 3.1 Functions to Query Vector Attributes

Subscripting Vectors

In this section, we look at the ways in which to extract subsets of data from a vector. We can achieve this using square brackets ([ ]) following the name of the vector, as follows:

VECTOR [ Input specifying the subset of data to return ]

The input itself can be one of a five possible inputs, as shown in Table 3.2.

Image

TABLE 3.2 Possible Vector Subscripting Inputs


Caution: Square versus Round Brackets

When we call a function, we use round brackets, as shown in our examples of the functions c, seq, and rep. We use square brackets to reference data from an object. If we use the wrong “type” of bracket, R will assume we are trying to call a function instead of reference data:

> X     # A vector called X
[1] 6 8 3 1 7
> X[]   # Using square brackets
[1] 6 8 3 1 7
> X()   # Error when using round brackets
Error: could not find function "X"


Subscripting Vectors: Blank Inputs

The first (and simplest) input is “blank,” which has the result of returning the entire vector of values:

> X <- c(6, 8, 3, 1, 7)  # Create a simple vector

> X                      # Print the values
[1] 6 8 3 1 7

> X [ ]                  # Blank input
[1] 6 8 3 1 7


Tip: White Space

White space is ignored by R (unless within quotation marks as part of a string). Therefore, in this example, the command X [ ] is equivalent to X[] or even X [ ]. As a convention, we will use spaces to improve readability where appropriate.


Subscripting Vectors: Positive Integers

If you specify a vector of integers as the input, they are used as an index of values to return:

> X                      # Print the values
[1] 6 8 3 1 7
> X [ c(1, 3, 5) ]       # 1st, 3rd and 5th elements
[1] 6 3 7

In the preceding example, we used a vector of positive integers within the square brackets as the index. However, we could alternatively create a separate vector with which to reference the data:

> index <- c(1, 3, 5)   # Create index vector
> X [ index ]           # 1st, 3rd and 5th elements
[1] 6 3 7

Using this approach, we could also specify values to omit from our vector. For example, if we wanted to return all values except the third value, we could achieve that as follows:

> X [ c(1:2, 4:5) ]    # Return the 1st, 2nd, 4th and 5th elements
[1] 6 8 1 7

Subscripting Vectors: Negative Integers

In the last example, we used a vector of positive integers to remove a value from a vector (that is, to omit one value in the return). However, for larger vectors this is not a scalable solution.

If we provide a vector of negative integers as the input, this refers to an index of values to omit from the vector, as illustrated in this example:

> X                    # Original vector of values
[1] 6 8 3 1 7
> X [ c(1:2, 4:5) ]    # Omit 3rd value using positive integers
[1] 6 8 1 7
> X [ -3 ]             # Omit 3rd value using negative integers
[1] 6 8 1 7

If we want to omit more than one position, we could either provide a vector of negative integers or place a minus symbol in front of a vector of positive integers. Consequently, the following two lines are equivalent:

> X [ c(-2, -4) ]   # Omit 2nd and 4th values
[1] 6 3 7
> X [ -c(2, 4) ]    # Omit 2nd and 4th values
[1] 6 3 7

Among other uses, this syntax allows us to exclude values from a vector based on another vector, as shown here:

> Y                 # Vector of values to subset
 [1] 6 9 4 3 6 8 1 9 0 3 4 8 7 4 5
> outliers          # Index of values to omit
[1]  4  7  9 11 15
> Y [ -outliers ]   # Omit the values specified in outliers
 [1] 6 9 4 6 8 9 3 8 7 4

Subscripting Vectors: Logical Values

Our third possible input is a vector of logical values the same length as the original vector. When we reference a vector in this way, only the corresponding TRUE values are returned, as illustrated here:

> X                        # Original vector
[1] 6 8 3 1 7
> c(T, T, F, F, T)         # Vector of logical values
[1]  TRUE  TRUE FALSE FALSE  TRUE
> X [ c(T, T, F, F, T) ]   # Return corresponding TRUE values only
[1] 6 8 7

The logical vector has TRUE values in the first, second, and fifth positions, so that is the index of values returned (6, 8, and 7).

Although this example illustrates the “mechanics” of how R returns values when given a logical vector input, in practice this is not useful (in other words, we will not commonly manually enter TRUE and FALSE values into a vector to subscript in this way).

More commonly, we use simple logical statements to create vectors of logical values to use as the input, as shown here:

> X            # Original vector
[1] 6 8 3 1 7
> X > 5        # Logical statement: where is X > 5?
[1]  TRUE  TRUE FALSE FALSE  TRUE
> X [ X > 5 ]  # Subset where values of X are greater than 5
[1] 6 8 7

This mirrors the previous example, although here we use a logical vector via the statement X > 5. Some other styles of logical statements we can use are listed here:

> X > 6            # Greater than 6
[1] FALSE  TRUE FALSE FALSE  TRUE
> X >= 6           # Greater than or equal to 6
[1]  TRUE  TRUE FALSE FALSE  TRUE
> X < 6            # Less than 6
[1] FALSE FALSE  TRUE  TRUE FALSE
> X <= 6           # Less than or equal to 6
[1]  TRUE FALSE  TRUE  TRUE FALSE
> X == 6           # X is equal to 6
[1]  TRUE FALSE FALSE FALSE FALSE
> X != 6           # X is not equal to 6
[1] FALSE  TRUE  TRUE  TRUE  TRUE
> X > 2 & X <= 6   # Between 2 (exclusive) and 6 (inclusive)
[1]  TRUE FALSE  TRUE FALSE FALSE
> X < 2 | X > 6    # Less than 2 or greater than 6
[1] FALSE  TRUE FALSE  TRUE  TRUE

Because these statements produce a logical vector that (by definition) is the same length of the input vector, they can all be used to subset the original vector:

> X                      # Original vector
[1] 6 8 3 1 7
> X [ X <= 6 ]           # Values less than or equal to 6
[1] 6 3 1
> X [ X != 6 ]           # Values that are not equal to 6
[1] 8 3 1 7
> X [ X >= 3 & X <= 7 ]  # Values between 3 and 7
[1] 6 3 7

It is important to consider that, for these examples, R performs a two-step process: The input is evaluated, returning the logical vector, which is then used to reference the original vector.

This allows us to reference values of one vector based on a second or third vector, as shown here:

> ID         # Vector of ID values
[1] 1001 1002 1003 1004 1005
> AGE        # Vector of ages
[1] 18 35 26 42 22
> GENDER     # Vector of genders
[1] "M" "F" "M" "F" "F"

> AGE [ AGE > 25 ]                 # Vectors of AGE that are greater than 25
[1] 35 26 42
> ID [ AGE > 25 ]                  # ID where AGE is greater than 25
[1] 1002 1003 1004
> ID [ AGE > 25 & GENDER == "F" ]  # ID where AGE is greater than 25 and GENDER is
                                     "F"
[1] 1002 1004

Subscripting Vectors: Character Values

When a vector has element names, we can use a vector of characters to refer to the elements to return. First, let’s add element names to our vector example:

> names(X) <- c("A", "B", "C", "D", "E")      # Add element names

> X                                           # Original vector
A B C D E
6 8 3 1 7

> X[c("A", "C", "E")]                         # Reference based on names
A C E
6 3 7

Subscripting Vectors: Summary

At this point, we have looked at referencing data from a vector by specifying one of five possible inputs, as shown earlier in Table 3.2, examples of which are shown here:

> X [ ]                     # Blank: all values returned
A B C D E
6 8 3 1 7
> X [ c(1, 3, 5) ]          # Positives: Positions to return
A C E
6 3 7
> X [ -c(1, 3, 5) ]         # Negatives: Positions to omit
B D
8 1
> X [ X > 5 ]               # Logical: TRUE values returned
A B E
6 8 7
> X [ c("A", "C", "E") ]    # Character: Named elements returned
A C E
6 3 7


Tip: Sequence of Letters

As discussed earlier, you cannot use the : notation to directly create a sequence of letters (for example, A:E). However, there are two in-built R vectors (called letters and LETTERS) that contain the (lowercase and uppercase) letters of the alphabet:

> letters
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m"
[14] "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

> LETTERS
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M"
[14] "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

Because these are vectors, we can reference them using square brackets with one of the five input types we just discussed. In this way, we can create sequences of lowercase or uppercase letters:

> letters [ 1:5 ]  # First 5 (lower case) letters
[1] "a" "b" "c" "d" "e"
> LETTERS [ 1:5 ]  # First 5 (upper case) letters
[1] "A" "B" "C" "D" "E"


Matrices

A matrix is a two-dimensional structure containing values of the same mode. Similar to the section “Vectors” earlier in this hour, in this section we look at the following topics:

Image Some ways to create matrices

Image The attributes of a matrix

Image The ways in which we can extract information from a matrix

Creating Matrices

You typically create matrices in two fundamental ways:

Image By combining a series of vectors to form rows or columns

Image By reading a single vector into a matrix structure

Combining Vectors to Create a Matrix

You can use the cbind function to combine a series of vectors, thus forming the columns of a matrix. An example, creating a three-row-by-four-column matrix, is shown here:

> cbind(1:3, 3:1, c(2, 4, 6), rep(1, 3))
     [,1] [,2] [,3] [,4]
[1,]    1    3    2    1
[2,]    2    2    4    1
[3,]    3    1    6    1


Note: Recycling

Note here that we’ve created a matrix by supplying four vectors of the same length to create our vector. However, if we supply vectors that are not of the same length, R will repeat the shorter-length vectors to the length of the longest vector to create the matrix. That means we can re-create the preceding matrix by specifying a 1 for the fourth column instead of repeating that value:

> cbind(1:3, 3:1, c(2, 4, 6), 1)
     [,1] [,2] [,3] [,4]
[1,]    1    3    2    1
[2,]    2    2    4    1
[3,]    3    1    6    1

In this example, the shorter-length vector is of length 1, which can easily be repeated to create a vector of length 3. If the shorter-length vectors cannot be recycled to exactly create the required length, a warning is provided. Consider the third column in this example:

> cbind(1:3, 3:1, c(2, 4), 1)
     [,1] [,2] [,3] [,4]
[1,]    1    3    2    1
[2,]    2    2    4    1
[3,]    3    1    2    1
Warning message:
In cbind(1:3, 3:1, c(2, 4), 1) :
  number of rows of result is not a multiple of vector length (arg 3)

As shown, the two values are repeated but a warning message is produced because the result is not a multiple of the longest-length vector.


Instead of using cbind, we can use the rbind function to specify the rows of a matrix. This time, we will use the same vectors to create a four-row-by-three-column matrix:

> rbind(1:3, 3:1, c(2, 4, 6), rep(1, 3))
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    3    2    1
[3,]    2    4    6
[4,]    1    1    1


Tip: Transposing Matrices

The t function can be used to transpose a matrix; therefore, the following commands are equivalent:

> cbind(1:3, 3:1, c(2, 4, 6), rep(1, 3))
     [,1] [,2] [,3] [,4]
[1,]    1    3    2    1
[2,]    2    2    4    1
[3,]    3    1    6    1
> t(rbind(1:3, 3:1, c(2, 4, 6), rep(1, 3)))
     [,1] [,2] [,3] [,4]
[1,]    1    3    2    1
[2,]    2    2    4    1
[3,]    3    1    6    1


Creating a Matrix with a Single Vector

As you just saw, the rbind and cbind functions can be used to create a matrix by combining vectors as rows or columns. An alternative way is to take a single vector of data and “read” the data into rows and columns of a matrix. You can achieve this using the matrix function, which accepts, as a first argument, the vector of data to be used:

> matrix(1:12)
      [,1]
 [1,]    1
 [2,]    2
 [3,]    3
 [4,]    4
 [5,]    5
 [6,]    6
 [7,]    7
 [8,]    8
 [9,]    9
[10,]   10
[11,]   11
[12,]   12

The matrix function has two arguments, nrow and ncol, that you can specify to create a matrix with specific “dimensions,” as shown here:

> matrix(1:12, nrow = 3, ncol = 4)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

In this example, we have used both nrow and ncol to specify the dimensions of the matrix. When we create a matrix in this way, we need only specify one dimension (nrow or ncol), as shown here:

> matrix(1:12, nrow = 3)
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

By default, the values are read in to the matrix in a column-wise manner, resulting in the first column containing the numbers 1 to 3 in this example. This is controlled by an argument to matrix called byrow, which, by default, is set to FALSE:

> matrix(1:12, nrow = 3, byrow = F)  # Default behavior – byrow = FALSE
     [,1] [,2] [,3] [,4]
[1,]    1    4    7   10
[2,]    2    5    8   11
[3,]    3    6    9   12

We can change this argument to instead read in the values by row, as shown here:

> matrix(1:12, nrow = 3, byrow = TRUE)
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12

Matrix Attributes

When we have created a matrix, we can query a number of matrix attributes using a set of utility functions. This includes functions to query the following:

Image The mode of the matrix

Image The dimensions of the matrix

Image The row/column names of the matrix

As before, we can query the mode of the matrix using the mode function:

> aVector <- c(4, 5, 2, 7, 6, 1, 5, 5, 0, 4, 6, 9)  # Create a vector
> X <- matrix(aVector, nrow = 3)                    # Create a matrix
> X                                                 # Print the matrix
     [,1] [,2] [,3] [,4]
[1,]    4    7    5    4
[2,]    5    6    5    6
[3,]    2    1    0    9
> mode(X)                                           # The mode of the matrix
[1] "numeric"

Similarly, we can use the length function to return the number of elements in the matrix:

> length(X)   # Number of elements
[1] 12

Although the length function returns the total number of elements in the matrix, it does not allow us to directly see the structure (that is, the number of rows and columns) of the matrix. For this, we can use the dim function, which returns a vector of length 2, specifying the rows (first) and columns of the matrix:

> dim(X)     # Dimension of the matrix
[1] 3 4
> dim(X)[1]  # Number of rows
[1] 3
> dim(X)[2]  # Number of columns
[1] 4

Here, we use positive integers to reference the position of the vector (returned by dim) to return (1 for rows, 2 for columns). Alternatively, we can use the functions nrow and ncol to directly return the number of rows and columns:

> nrow(X)  # Number of rows
[1] 3
> ncol(X)  # Number of columns
[1] 4

Earlier you saw that vectors can be associated with element names. With matrices, it is not practical to assign a name for each element (cell) of the matrix. However, you might see matrices that have row and column names.

You’ll either create matrices with row and column names (or “dimension names”) or, more commonly, come across matrices with dimension names as the result of an operation.

Consider an example where we have created a frequency count of age group versus gender from a set of data. These numbers could be returned as a matrix, as shown next:

> freqMatrix  # Frequency by Age Group and Gender
     [,1] [,2]
[1,]   75   68
[2,]   52   49
[3,]   38   30

Here, we can see that the matrix contains six values, which relate to the frequency count by age group and gender. However, without labels, we do not know what the values refer to. As such, R may return a matrix with dimension names, as shown here:

> freqMatrix
      Female Male
18-35     75   68
26-35     52   49
36+       38   30

If we want to create a matrix with dimension names, we can assign names using the dimnames function. It accepts a “list” structure with row and column names. (Note that we will cover lists in Hour 4, “Multi-Mode Data Structures.”) Here’s an example:

> freqMatrix                 # Original matrix – no row/column names
     [,1] [,2]
[1,]   75   68
[2,]   52   49
[3,]   38   30

> dimnames(freqMatrix) <- list(c("18-35", "26-35", "36+"),
+   c("Female", "Male"))     # Assign dimension names

> freqMatrix                 # Resulting matrix
      Female Male
18-35     75   68
26-35     52   49
36+       38   30

When we see a matrix that has dimension names, we can query those names using the dimnames function, which returns a “list” containing two character vectors:

> dimnames(freqMatrix)        # Dimension names of freqMatrix
[[1]]
[1] "18-35" "26-35" "36+"

[[2]]
[1] "Female" "Male"

Subscripting Matrices

When we covered vectors, you saw that we can use square brackets with one of five input types to extract data. This included examples such as the following:

Image Select the first five elements.

Image Select all but the sixth element.

Image Select all values greater than 5.

Image Select the "A", "C", and "E" elements.

With a matrix, which has rows and columns, these selections no longer seem particularly relevant. However, we may wish to select specific rows and columns, which we specify using 2 separate inputs within the square brackets separated by a comma:

MATRIX [ Input specifying rows to return, Input specifying columns to return ]

Subscripting Matrices: Blanks, Positives, and Negatives

First, let’s look at using blank subscripts for both rows and columns. The following returns all rows and all columns:

> X [ , ]  # Blank for rows, blank for columns
     [,1] [,2] [,3] [,4]
[1,]    4    7    5    4
[2,]    5    6    5    6
[3,]    2    1    0    9

Next, we’ll use vectors of positive integers for both the rows and columns:

> X [ 1:2 , c(1, 3, 4) ]  # +ives for rows, +ives for columns
     [,1] [,2] [,3]
[1,]    4    5    4
[2,]    5    5    6

In this example, we returned the first two rows and the first, third, and fourth columns.


Note: Column Index

In this example, note that we selected rows 1 and 2 with columns 1, 3 and 4, and the matrix returned the correct matrix subset. The column index of the new matrix is [,1] [,2] [,3].

This is because the subset is a completely new matrix with its own column index, and it has no “memory” of the manner in which it was created (in other words, the index is not “1, 3, 4”). If, however, the matrix we were subsetting had dimension names, the row/column names would be retained in the sub-matrix.


So far, we have used blanks on the rows and columns, then vectors of positive integers for both rows and columns. However, we can also specify different input types for the rows and columns, as shown in this example:

> X [  , -2 ]  # Blank for rows, -ives for columns
     [,1] [,2] [,3]
[1,]    4    5    4
[2,]    5    5    6
[3,]    2    0    9

Here, we use blank for the rows (so all rows are returned) and a negative integer for the columns (so all but the second column is returned).

Dropping Dimensions

In the preceding example, we referenced data from a 3×4 matrix, but always returned at least two rows/columns. If we instead reference a single row or column, the dimensions of the output matrix are dropped, so a simpler structure (in fact, a vector) is returned:

> X [ , 1:2 ]   # First 2 columns - returns a matrix
     [,1] [,2]
[1,]    4    7
[2,]    5    6
[3,]    2    1
> X [ , 1 ]     # First column - returns a vector
[1] 4 5 2

Because most R functions work with vectors, the “dropping” of dimensions in this way is often what we want. However, if we want to reference the data but ensure the dimensions are not dropped, we can use an argument called drop within the square brackets, as shown here:

> X [ , 1 ]                # Returns a vector
[1] 4 5 2
> X [ , 1, drop = FALSE ]  # Use drop to maintain dimensions
     [,1]
[1,]    4
[2,]    5
[3,]    2

Subscripting Matrices: Logical Values

We can use logical values to reference rows and/or columns of a matrix. To achieve this, we provide a logical vector the same length as the numbers of rows/columns to subscript. A simple example is shown here:

> X                     # Original Matrix
     [,1] [,2] [,3] [,4]
[1,]    4    7    5    4
[2,]    5    6    5    6
[3,]    2    1    0    9

> X [ c(T, F, T), ]     # Logical for rows, blank for columns
     [,1] [,2] [,3] [,4]
[1,]    4    7    5    4
[2,]    2    1    0    9

In this example, a logical vector is used to subscript the matrix. We provide a logical vector length of 3, and only the rows corresponding to the TRUE values are returned (the first and third rows).

Instead of specifying a vector manually, we could use a logical statement based on one of the other columns to subscript the data. For example, let’s consider referencing only rows where the first column is not 5:

> X [ , 1 ]               # 1st column
[1] 4 5 2

> X [ , 1 ] != 5          # Where is the 1st column not 5
[1]  TRUE FALSE  TRUE

> X [ X [ , 1 ] != 5 , ]  # Use to subscript the data
     [,1] [,2] [,3] [,4]
[1,]    4    7    5    4
[2,]    2    1    0    9

This last line looks particular complex, but relates to syntax that is rarely used. The single-mode nature of matrices means it is not a good structure in which to store our standard rectangular data; there is a more appropriate structure to hold this sort of data (the data.frame structure, covered in Hour 4) that has a simpler syntax for referencing subsets of data.

Subscripting Matrices: Character Values

So far, we have discussed how matrices can be referenced using blank, positive, negative, and logical inputs. If we have a matrix with row and column names, we can also use vectors of characters to refer directly to the rows and columns we wish to return. First, let’s add dimension names to our matrix example:

> dimnames(X) <- list( letters[1:3], LETTERS[1:4] )
> X
  A B C D
a 4 7 5 4
b 5 6 5 6
c 2 1 0 9

Now we can use character vectors to reference the rows and/or columns. For example, let’s reference rows “a” and “c” with all the columns:

> X [ c("a", "c"), ]   # Characters for rows, blank for columns
  A B C D
a 4 7 5 4
c 2 1 0 9

In this next example, we use a character vector to reference the columns we want to return and all the rows:

> X [ , c("A", "C", "D") ]   # Blank for rows, Characters for columns
  A C D
a 4 5 4
b 5 5 6
c 2 0 9

Arrays

At the start of this hour we introduced vectors as a structure that contains a series of values of the same mode. Next, we looked at matrices as a single-mode structure with rows and columns.

An array is a single-mode structure that can have any number of dimensions (so, in fact, a matrix in R is simply a two-dimensional array).

Similar to the previous sections in this hour on vectors and matrices, in this section we look at the following:

Image Some ways to create an array

Image The attributes of an array

Image The ways in which we can extract information from an array

For the purposes of this hour, we will focus on three-dimensional arrays, but the code works in a similar way for any dimension of array.

Creating Arrays

You create an array by providing a single vector input to the array function along with the dimension of the array you wish to create (as a vector of integers). The following example creates a two-dimensional array (that is, a matrix):

> aVector <- c(4, 5, 2, 7, 6, 1, 5, 5, 0, 4, 6, 9)  # Create a vector
> X <- array(aVector, dim = c(3, 4))                # Create a 2D array (matrix)
> X                                                 # Print the matrix
     [,1] [,2] [,3] [,4]
[1,]    4    7    5    4
[2,]    5    6    5    6
[3,]    2    1    0    9

If you want to create a three-dimensional array, you specify a vector of length of 3 for the dim argument, as shown here:

> aVector <- c(4, 5, 2, 7, 6, 1, 5, 5, 0, 4, 6, 9)  # Create a vector
> X <- array(rep(aVector, 3), dim = c(3, 4, 3))     # Create a 3D array
> X                                                 # Print the array
, , 1

     [,1] [,2] [,3] [,4]
[1,]    4    7    5    4
[2,]    5    6    5    6
[3,]    2    1    0    9

, , 2

     [,1] [,2] [,3] [,4]
[1,]    4    7    5    4
[2,]    5    6    5    6
[3,]    2    1    0    9

, , 3

     [,1] [,2] [,3] [,4]
[1,]    4    7    5    4
[2,]    5    6    5    6
[3,]    2    1    0    9

Array Attributes

Attributes for arrays can be referenced in exactly the same way as you saw for matrices. Some examples of extracting array attributes can be seen here:

> mode(X)       # Mode of array
[1] "numeric"
> length(X)     # Number of elements in array
[1] 36
> dim(X)        # Dimension of array
[1] 3 4 3

As with matrices, you specify dimension names using the dimnames function:

> dimnames(X) <- list(letters[1:3], LETTERS[1:4], c("X1", "X2", "X3"))
> X
, , X1

  A B C D
a 4 7 5 4
b 5 6 5 6
c 2 1 0 9

, , X2

  A B C D
a 4 7 5 4
b 5 6 5 6
c 2 1 0 9

, , X3

  A B C D
a 4 7 5 4
b 5 6 5 6
c 2 1 0 9

Subscripting Arrays

To extract data from an array, you provide one input per dimension. Therefore, for a three-dimensional array, you need to provide three inputs, each of which can be one of the five types of input (blank, positives, negatives, logicals, or characters).

Some examples of array subscripting with our sample (three-dimensional) array are shown here:

> X [ , , 1 ]         # Blank / Blank / Positive
  A B C D
a 4 7 5 4
b 5 6 5 6
c 2 1 0 9
> X [ -1, 1:2, 1:2 ]  # Negative / Positive / Positive
, , X1

  A B
b 5 6
c 2 1

, , X2

  A B
b 5 6
c 2 1

Relationship Between Single-Mode Data Objects

So far in this hour we have looked at the three “single-mode” data structures in R: vectors, matrices, and arrays. You have seen how to create these structures, how to query attributes of the structures, and how to extract data from them.

Table 3.3 describes the key aspects of each of these structures.

Image

TABLE 3.3 Comparison of Single-Mode Data Structures

During this hour you may have noticed a pattern emerging with the three structures, which is also prevalent in Table 3.3. In fact, these three structures are very closely related because they are all, fundamentally, vectors. The only thing that distinguishes vectors from matrices and arrays is the dimension of the structure, which allows you to print, manage, and reference the data from structures in a particular manner.

This allows you to very easily convert from one structure to another by (re)specifying the dimension with the dim function. Consider the following code, which converts a vector first to a matrix and then to a three-dimensional array:

> X <- c(2, 6, 5, 1, 2, 8, 9, 4, 3, 1, 9, 4)   # Create a vector
> X                                            # Print the vector
 [1] 2 6 5 1 2 8 9 4 3 1 9 4
> length(X)                                    # Vector has 12 elements
[1] 12
> dim(X)                                       # Vectors have no "dimension"
NULL

> dim(X) <- c(3, 4)                            # Assign a dimension (3 x 4)
> X                                            # Print X - it is now a matrix
     [,1] [,2] [,3] [,4]
[1,]    2    1    9    1
[2,]    6    2    4    9
[3,]    5    8    3    4

> dim(X) <- c(2, 3, 2)                         # Assign a new dimension (2 x 3 x 2)
> X                                            # Print X - it is now a 3D array
, , 1

     [,1] [,2] [,3]
[1,]    2    5    2
[2,]    6    1    8

, , 2

     [,1] [,2] [,3]
[1,]    9    3    9
[2,]    4    1    4

This also allows you to treat matrices and arrays as vectors for simple functions later, for example:

> dim(X)     # X is an array
[1] 2 3 2
> median(X)  # Median of X
[1] 4

Summary

In this hour, we have looked at how the four different “modes” of data in R (numeric, character, logical, and complex) can be stored in the three single-mode structures: vectors, matrices, and arrays. We have looked at the ways in which we can create each structure, the attributes each structure has, and how to reference subsets of data from each structure.

Although we have covered matrices and arrays in this section, the majority of the time was spent looking at vectors in some details. This reflects the fact that we typically work with vectors as a primary data structure, so familiarity with how to manage these objects is essential.

Of course, in this hour we have looked only at “single mode” structures (i.e. those structures that only hold a single mode of data). In the next hour, we will look at two data structures that allow us to store data with more than one mode: lists and data frames.

Q&A

Q. Can I mix the five types of subscript input?

A. Not really, because one of two things will happen: either R will convert all elements in the subscript input to a single type or, if you use positives and negatives together, R will return an error.

Q. Why is a matrix not a suitable structure to hold standard rectangular datasets?

A. Because it is a single-mode structure, it isn’t capable of storing (say) a numeric column and a character column from a dataset together. In the next hour, you will see a more natural structure for storing this sort of data.

Q. What if I try to reference data outside of the dimensions?

A. Missing values will be returned, as shown in this example:

> X <- c(A = 1, B = 2, C = 3)
> X
A B C
1 2 3
> X[2:5]
   B    C <NA> <NA>
   2    3   NA   NA
> X[c("A", "C", "E")]
   A    C <NA>
   1    3   NA

Q. How do missing values impact referencing with logical values?

A. If you use a vector of missing values in a logical statement, the return value will also be NA (because you don’t know whether the missing value would have met the condition). When you use this to subscript, missing values are returned. Consider the following example:

> ID
[1] 1 2 3 4 5
> AGE
[1] 18 35 25 NA 23
> AGE >= 25
[1] FALSE  TRUE  TRUE    NA FALSE
> ID [ AGE >= 25 ]
[1]  2  3 NA

Workshop

The workshop contains quiz questions and exercises to help you solidify your understanding of the material covered. Try to answer all questions before looking at the “Answers” section that follows.

Quiz

1. What are the four different “modes” of data in R?

2. Why do we refer to vectors, matrices, and arrays as “single-mode” structures?

3. What function can you use to create a vector of repeated sequences?

4. What are the five different “subscript” inputs you can use to reference a subset of data from a vector?

5. What is the difference between the cbind and rbind functions?

6. Why do we use a comma within the square brackets when subscripting a matrix (for example, mat[1:2, -1])?

7. What is the difference between a matrix and an array?

Answers

1. The four modes of data are numeric, character, logical, and complex.

2. “Single mode” refers to the fact that these structures can only store data of a single mode (for example, a “numeric” vector or a “character” matrix). Vectors, matrices, and arrays cannot hold data of more than one “mode.”

3. You can use the rep function to create a vector of repeated sequences.

4. The five “subscript” input types are blanks, vectors of positive integers, vectors of negative integers, vectors of logical values, and vectors of characters.

5. Both functions create a matrix based on a number of vector inputs. The cbind function specifies that provided vectors are to be used as the columns of the matrix, whereas rbind specifies that the provided vectors should be used to define the rows of the matrix.

6. We use a comma to separate the “row” subscripts from the “column” subscripts. Therefore, the line mat[1:2, -1] specifies that we want to return the first two rows, and all but the first column of mat.

7. A matrix is strictly a two-dimensional structure (it has rows and columns). An array is a structure with any number of dimensions (that is, we could create a three-, four-, 10-, or 100-dimensional array). A two-dimensional array is exactly equal to a matrix.

Activities

1. There is an object in R called pi. What is the length and mode of pi?

2. Create the following vectors in R:

[1] 6 3 4 8 5 2 7 9 4 5
[1] TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
[1] -1 0 1 2 3
[1] 5 4 3 2 1
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
[1] 1 2 3 1 2 3 1 2 3
[1] "A" "A" "A" "A"
[1] "A" "A" "A" "A" "B" "B" "B" "C" "C" "D"

3. Using the LETTERS vector, print the following:

Image The first four letters

Image All but the first four letters

Image The “even” letters (that is, A, C, E, G, ...)

4. Create a numeric vector of length 10 using a selection of integers between 1 and 9. Assign the first 10 elements of the letters vector as the element names of your vector. Using this vector, do the following:

Image Select the first and last values of the vector.

Image Select all values of the vector greater than 3.

Image Select all values of the vector between 2 and 7.

Image Select all values of the vector that are not 5.

Image Select the "D", "E", and, "G" elements of your vector.

5. Create a 3×4 matrix containing numeric values. Print the first two rows and all but the last column of this matrix.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset