In this chapter, we are going to cover the basic data structure in R—a vector. Understanding vectors is the foundation for all the subsequent chapters. You will learn how to perform efficient operations on numeric and logical vectors and how to create subsets. After this, you will learn how to write custom functions in order to expand and customize R's capabilities. Working with dates and time series and the use of graphical functions are introduced at the end of this chapter.
In this chapter, we'll cover the following topics:
A vector is an ordered collection of values of the same type (or mode, in R terminology). As mentioned in the previous chapter, the three types of values that are useful for most purposes (including the topics of this book) are numeric, character, and logical. In this section, you are going to learn about several methods to create vectors, check the properties of interest for the given vectors, and perform operations involving pairs of vectors. You are also going to learn how to save the objects we create in the temporary computer memory via assignment.
Vectors are the most basic data structures in R since single elements (such as the number 10) are also represented in R by vectors (of length 1). As we have previously seen, when we enter a numeric value on the command line, it is printed on the screen. The number in square brackets to the left of the value is, in fact, the position of the leftmost element in the respective printed line. For example, the [1]
part in the following output means the first (and only) printed element, 10
, of the particular vector is at position 1:
> 10 [1] 10
Vectors can be created from individual elements with the c
function, which stands for combine. Let's take a look at the following examples:
> c(1,5,10,4) [1] 1 5 10 4 > c("cat","dog","mouse","apple") [1] "cat" "dog" "mouse" "apple"
Sequential numeric vectors can be easily created with the :
operator. Such vectors have many uses in R. The :
operator creates an ordered vector starting at the value to the left of the : symbol and ending at the value to the right of the : symbol, as follows:
> 7:20 [1] 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Or, when the first argument is larger than the second one:
> 33:24 [1] 33 32 31 30 29 28 27 26 25 24
A logical vector can also be created with the c
function. Remember that TRUE
and FALSE
are special values and not characters. Therefore, these values should be typed without quotes:
> c(TRUE,FALSE,TRUE,TRUE) [1] TRUE FALSE TRUE TRUE
However, in practice, the creation of logical vectors is usually associated with employing a conditional operator on a vector rather than manually typing a sequence of logical values. We will elaborate on this later.
There are several functions that we can use to convert between vector types. The two most useful ones are as.numeric
and as.character
, which are used to convert a vector to a numeric or character vector, respectively. There are other functions to convert objects of a particular class into another, which we'll see in subsequent chapters. Let's take a look at the following examples:
> 33:24 [1] 33 32 31 30 29 28 27 26 25 24 > as.character(33:24) [1] "33" "32" "31" "30" "29" "28" "27" "26" "25" "24" > as.numeric(as.character(33:24)) [1] 33 32 31 30 29 28 27 26 25 24
A factor
is a special type of encoding for a vector, where the vector has a defined set of acceptable values or levels. Such an encoding is most common in statistical uses of R, for example, when defining categorical variables to identify treatments in an experiment. Using factors is not essential for the purposes of this book. However, encountering factors is inevitable when working with R (for example, when reading a table from a file, character columns are encoded as factors by default), so at the very least, we need to be aware of this data structure.
The factor
function can be used to convert a vector into a factor
:
> factor(c("cat","dog","dog")) [1] cat dog dog Levels: cat dog > factor(c("cat","dog","dog","mouse")) [1] cat dog dog mouse Levels: cat dog mouse
As you can see, the acceptable levels of the resulting factor
object (which are, by default, defined as a set of unique values that the vector has) are printed along with its values.
So far we have used R by entering standalone expressions on the command line. As mentioned in the previous chapter, the returned objects are not saved anywhere this way. Therefore, we cannot make sequential operations with each created object serving as an input for the next step(s). However, saving intermediate result is essential to automate processes.
Saving objects to the temporary memory is called assignment. By temporary, we mean that the objects are deleted when we shut down the computer (or quit R), as opposed to writing to a file on the hard drive, where the information will permanently remain unless it's deleted. Assignment is performed by an assignment expression, which is composed of the object we would like to save, the assignment operator =
, and the name we would like to give the new object. For example, we can save the 1:10
sequential vector to an object named v
as follows:
> v = 1:10
We can then access our newly created object using its name the same way we accessed predefined objects (such as pi
) in the previous chapter:
> v [1] 1 2 3 4 5 6 7 8 9 10
There is another assignment operator in R, namely <-
:
> v <- 1:10
Throughout this book, the =
operator is used since it is easier to type.
Also, note the difference between the assignment operator =
and the equality conditional operator ==
(see the previous chapter). The =
operator is used for assignment:
> one = 1 > two = 2 > one = two > one [1] 2 > two [1] 2
The ==
operator is used to compare:
> one = 1 > two = 2 > one == two [1] FALSE
When assigning an object with a name that already exists in memory, the older object is deleted and replaced by the new one:
> x = 55 > x [1] 55 > x = "Hello" > x [1] "Hello"
The ls
function returns a character vector with the names of all the user-defined objects (in a given environment, with the default one being the global R environment). For example, so far we have assigned four objects in memory:
> ls() [1] "one" "two" "v" "x"
We can remove objects from memory by using the rm
function. Let's take a look at the following examples:
> rm("v") > ls() [1] "one" "two" "x"
It is sometimes useful to remove all objects from memory. For example, if we want to run a given code section without worrying that the previously defined objects will interfere, this can be done by passing the whole list of objects currently in memory to the rm
function as follows:
> rm(list = ls()) > ls() character(0)
The character(0)
output indicates an empty character vector.
Removing all objects can be achieved by navigating to Misc | Remove all objects (RGui) or Session | Clear workspace… (RStudio). The reason for writing the list=ls()
part will become evident after reading the Using functions with several parameters section in this chapter.
Many functions in R are intended to work with vectors. The current section reviews some commonly used functions that are used to find out vectors' properties.
For example, we may be interested in the mean, minimal, and maximal values of a given vector. To get these, we can use the mean
, min
, and max
functions, respectively:
> v = 1:10 > mean(v) [1] 5.5 > min(v) [1] 1 > max(v) [1] 10
We can also get both the minimal and maximal values at once with the range
function:
> range(v) [1] 1 10
The length
function returns the number of elements a given vector has:
> v = c(4,2,3,9,1) > length(v) [1] 5
With logical vectors, we sometimes would like to know whether they contain at least one TRUE
value or whether all of their values are TRUE
. This can be achieved with the any
and all
functions, respectively:
> l = c(TRUE, FALSE, FALSE, TRUE) > any(l) [1] TRUE > all(l) [1] FALSE
If we would like to know how many TRUE
values a vector contains, we can utilize the default transformation from logical to numeric vectors when arithmetic functions are used on the former:
> l = c(TRUE, FALSE, FALSE, TRUE) > sum(l) [1] 2
In this example, each TRUE
value was first converted to 1
and each FALSE
value to 0
. Therefore, the vector c(TRUE,FALSE,FALSE,TRUE)
became the vector c(1,0,0,1)
and the sum of this vector's elements is 2
.
The which
function returns the positions of all the TRUE
elements within a logical vector:
> which(l) [1] 1 4
Here, a vector of length 2
was returned since there are two TRUE
values in the vector l
. The two values of this vector are 1
and 4
since the first TRUE
value occupies the first position in the vector l
, while the second TRUE
value occupies the fourth position.
The last useful function we will mention is the unique
function, which returns the unique elements of a vector; that is, it returns a set of elements the vector consists of without repetitions. Let's take a look at the following examples:
> v = c(5,6,2,2,3,0,-1,2,5,6) > unique(v) [1] 5 6 2 3 0 -1
As opposed to functions that treat the vector as a single entity (as seen in the previous section), some functions work on each element of the vector as if it was a separate entity and return a vector of the results (which, therefore, has the same number of elements as the input vector). In fact, all arithmetic and logical operators work this way, as shown in the following examples (we did not have a chance to witness this since we always used vectors of length 1):
> 1:10 * 2 [1] 2 4 6 8 10 12 14 16 18 20 > 1:10 - 10 [1] -9 -8 -7 -6 -5 -4 -3 -2 -1 0 > sqrt(c(4,16,64)) [1] 2 4 8
In the first expression, we multiplied the vector (1, 2, ..., 10) by 2, which resulted in a vector of 10 elements where the first element is equal to 1*2, the second is equal to 2*2, the third is equal to 3*2, and so on, up to 10*2.
Logical operators function in the same way, shown as follows:
> x = 1:5 > x [1] 1 2 3 4 5 > x >= 3 [1] FALSE FALSE TRUE TRUE TRUE
Here, for each of the values 1, 2, 3, 4, 5, it has been evaluated whether the value is larger or equal to 3, giving FALSE
for 1 and 2 and TRUE
for 3, 4, and 5.
If we want to check whether a given value from one vector is present in another, we can use the %in%
operator. With %in%
, we basically ask whether the value(s) of a vector on the left match any of the values of a vector on the right:
> 1 %in% 1:10 [1] TRUE > 11 %in% 1:10 [1] FALSE
For these simple examples, we can do without the %in%
operator (see the following examples). Its utility will become apparent towards the end of this chapter, when we want, for instance, to look for each element of a long vector A and check whether it has a match in a long vector B. Here are the alternatives to the preceding expressions:
> any(1:10 == 1) [1] TRUE > any(1:10 == 11) [1] FALSE
In these two examples, we encompass the logical operation within the any
function to check whether the resulting logical vector has at least one TRUE
value.
Now, let's move on to character vectors. When working with character values, the paste
function can be useful to combine separate elements into a single character string. The sep
parameter of this function determines which characters will be used to separate the values (a single space is the default). Let's take a look at the following example:
> paste("There are", "5", "books.") [1] "There are 5 books."
The paste
function also works with vectors that have more than one element:
> paste("Image", 1:5) [1] "Image 1" "Image 2" "Image 3" "Image 4" "Image 5"
Note that the paste
function automatically converts numeric values into characters if characters are supplied:
> x = 80 > paste("There are", x, "books.") [1] "There are 80 books."
In the previous chapter, we only used operators on two vectors of length 1. In this chapter, so far, we have used operations involving one vector of length 1 and another of length >1. What happens when we perform an operation involving two vectors of length >1?
If we have two vectors of exactly the same length, the operation is performed on each consecutive pair of elements taken from the two vectors, as follows:
> c(1,2,3) * c(10,20,30) [1] 10 40 90
In this example, 1
is multiplied by 10
, 2
is multiplied by 20
, and 3
is multiplied by 30
, and the three results are combined into a single vector of length 3.
In case when the lengths of the two vectors are unequal, the shorter vector is recycled before the operation is performed. In other words, values at the beginning of the shorter vector are attached to its end, sequentially and as many times as necessary, until the lengths of both vectors match. The simplest case, which we witnessed in the previous section, is the one that involves one vector of length 1 and another vector of length greater than 1. We can describe what happens in such a case as the recycling of the vector that has one element until it matches the length of the longer vector. For example, when executing the first of these two expressions, it is as if we are performing the second:
> 1:4 * 3 [1] 3 6 9 12 > 1:4 * c(3,3,3,3) [1] 3 6 9 12
The same way, in the following example, the vector c(3,5)
is recycled until it is of length 4, to c(3,5,3,5)
. The result is c(1,2,3,4)
multiplied by c(3,5,3,5)
:
> c(1,2,3,4) * c(3,5) [1] 3 10 9 20
When the length of the longer vector is not a multiple of the shorter vector, recycling is incomplete and we receive a warning message. Nevertheless, the operation is carried out. In the next example, the vector c(1,10,100)
is of length 3, while the vector 1:5
is of length 5. The vector c(1,10,100)
is recycled to c(1,10,100,1,10)
, which is the same length as the vector c(1,2,3,4,5)
, as follows:
> 1:5 * c(1,10,100) [1] 1 20 300 4 50 Warning message: In 1:5 * c(1, 10, 100) : longer object length is not a multiple of shorter object length