The vectors, matrices, and arrays that we have seen so far contain elements that are all of the same type. Lists and data frames are two types that let us combine different types of data in a single variable.
After reading this chapter, you should:
length
, names
, and other functions to inspect and manipulate these types
NULL
is and when to use it
A list is, loosely speaking, a vector where each element can be of a different type. This section concerns how to create, index, and manipulate lists.
Lists are created with the list
function, and specifying the contents works much like the c
function that we’ve seen already. You simply list the contents, with each argument separated by a comma. List elements can be any variable type—vectors, matrices, even functions:
(
a_list<-
list(
c(
1
,
1
,
2
,
5
,
14
,
42
),
#See http://oeis.org/A000108
month.abb,
matrix(
c(
3
,
-8
,
1
,
-3
),
nrow=
2
),
asin))
## [[1]] ## [1] 1 1 2 5 14 42 ## ## [[2]] ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" ## [12] "Dec" ## ## [[3]] ## [,1] [,2] ## [1,] 3 1 ## [2,] -8 -3 ## ## [[4]] ## function (x) .Primitive("asin")
As with vectors, you can name elements during construction, or afterward using the names
function:
names(
a_list)
<-
c(
"catalan"
,
"months"
,
"involutary"
,
"arcsin"
)
a_list
## $catalan ## [1] 1 1 2 5 14 42 ## ## $months ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" ## [12] "Dec" ## ## $involutary ## [,1] [,2] ## [1,] 3 1 ## [2,] -8 -3 ## ## $arcsin ## function (x) .Primitive("asin")
(
the_same_list<-
list(
catalan=
c(
1
,
1
,
2
,
5
,
14
,
42
),
months=
month.abb,
involutary=
matrix(
c(
3
,
-8
,
1
,
-3
),
nrow=
2
),
arcsin=
asin))
## $catalan ## [1] 1 1 2 5 14 42 ## ## $months ## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" ## [12] "Dec" ## ## $involutary ## [,1] [,2] ## [1,] 3 1 ## [2,] -8 -3 ## ## $arcsin ## function (x) .Primitive("asin")
It isn’t compulsory, but it helps if the names that you give elements are valid variable names.
It is even possible for elements of lists to be lists themselves:
(
main_list<-
list(
middle_list=
list(
element_in_middle_list=
diag(
3
),
inner_list=
list(
element_in_inner_list=
pi^
1
:4
,
another_element_in_inner_list=
"a"
)
),
element_in_main_list=
log10(
1
:10
)
))
## $middle_list ## $middle_list$element_in_middle_list ## [,1] [,2] [,3] ## [1,] 1 0 0 ## [2,] 0 1 0 ## [3,] 0 0 1 ## ## $middle_list$inner_list ## $middle_list$inner_list$element_in_inner_list ## [1] 3.142 ## ## $middle_list$inner_list$another_element_in_inner_list ## [1] "a" ## ## ## ## $element_in_main_list ## [1] 0.0000 0.3010 0.4771 0.6021 0.6990 0.7782 0.8451 0.9031 0.9542 1.0000
In theory, you can keep nesting lists forever. In practice, current versions of R will throw an error once you start nesting your lists tens of thousands of levels deep (the exact number is machine specific). Luckily, this shouldn’t be a problem for you, since real-world code where nesting is deeper than three or four levels is extremely rare.
Due to this ability to contain other lists within themselves, lists are considered to be recursive variables. Vectors, matrices, and arrays, by contrast, are atomic. (Variables can either be recursive or atomic, never both; Appendix A contains a table explaining which variable types are atomic, and which are recursive.) The functions is.recursive
and is.atomic
let us test variables to see what type they are:
is.atomic(
list())
## [1] FALSE
is.recursive(
list())
## [1] TRUE
is.atomic(
numeric())
## [1] TRUE
is.recursive(
numeric())
## [1] FALSE
Like vectors, lists have a length. A list’s length is the number of top-level elements that it contains:
length(
a_list)
## [1] 4
length(
main_list)
#doesn't include the lengths of nested lists
## [1] 2
Again, like vectors, but unlike matrices, lists don’t have dimensions. The dim
function correspondingly returns NULL
:
dim(
a_list)
## NULL
nrow
, NROW
, and the corresponding column functions work on lists in the same way as on vectors:
nrow(
a_list)
## NULL
ncol(
a_list)
## NULL
NROW(
a_list)
## [1] 4
NCOL(
a_list)
## [1] 1
Unlike with vectors, arithmetic doesn’t work on lists. Since each element can be of a different type, it doesn’t make sense to be able to add or multiply two lists together. It is possible to do arithmetic on list elements, however, assuming that they are of an appropriate type. In that case, the usual rules for the element contents apply. For example:
l1<-
list(
1
:5
)
l2<-
list(
6
:10
)
l1[[
1
]]
+
l2[[
1
]]
## [1] 7 9 11 13 15
More commonly, you might want to perform arithmetic (or some other operation) on every element of a list. This requires looping, and will be discussed in Chapter 8.
l<-
list(
first=
1
,
second=
2
,
third=
list(
alpha=
3.1
,
beta=
3.2
)
)
As with vectors, we can access elements of the list using square brackets, []
, and positive or negative numeric indices, element names, or a logical index. The following four lines of code all give the same result:
l[
1
:2
]
## $first ## [1] 1 ## ## $second ## [1] 2
l[
-3
]
## $first ## [1] 1 ## ## $second ## [1] 2
l[
c(
"first"
,
"second"
)]
## $first ## [1] 1 ## ## $second ## [1] 2
l[
c(
TRUE
,
TRUE
,
FALSE
)]
## $first ## [1] 1 ## ## $second ## [1] 2
The result of these indexing operations is another list. Sometimes we want to access the contents of the list elements instead. There are two operators to help us do this. Double square brackets ([[]]
) can be given a single positive integer denoting the index to return, or a single string naming that element:
l[[
1
]]
## [1] 1
l[[
"first"
]]
## [1] 1
The is.list
function returns TRUE
if the input is a list, and FALSE
otherwise. For comparison, take a look at the two indexing operators:
is.list(
l[
1
])
## [1] TRUE
is.list(
l[[
1
]])
## [1] FALSE
For named elements of lists, we can also use the dollar sign operator, $
. This works almost the same way as passing a named string to the double square brackets, but has two advantages. Firstly, many IDEs will autocomplete the name for you. (In R GUI, press Tab for this feature.) Secondly, R accepts partial matches of element names:
l$
first
## [1] 1
l$
f#partial matching interprets "f" as "first"
## [1] 1
To access nested elements, we can stack up the square brackets or pass in a vector, though the latter method is less common and usually harder to read:
l[[
"third"
]][
"beta"
]
## $beta ## [1] 3.2
l[[
"third"
]][[
"beta"
]]
## [1] 3.2
l[[
c(
"third"
,
"beta"
)]]
## [1] 3.2
The behavior when you try to access a nonexistent element of a list varies depending upon the type of indexing that you have used. For the next example, recall that our list, l
, has only three elements.
If we use single square-bracket indexing, then the resulting list has an element with the value NULL
(and name NA
, if the original list has names). Compare this to bad indexing of a vector where the return value is NA
:
l[
c(
4
,
2
,
5
)]
## $<NA> ## NULL ## ## $second ## [1] 2 ## ## $<NA> ## NULL
l[
c(
"fourth"
,
"second"
,
"fifth"
)]
## $<NA> ## NULL ## ## $second ## [1] 2 ## ## $<NA> ## NULL
Trying to access the contents of an element with an incorrect name, either with double square brackets or a dollar sign, returns NULL
:
l[[
"fourth"
]]
## NULL
l$
fourth
## NULL
Finally, trying to access the contents of an element with an incorrect numerical index throws an error, stating that the subscript is out of bounds. This inconsistency in behavior is something that you just need to accept, though the best defense is to make sure that you check your indices before you use them:
l[[
4
]]
#this throws an error
Vectors can be converted to lists using the function as.list
. This creates a list with each element of the vector mapping to a list element containing one value:
busy_beaver<-
c(
1
,
6
,
21
,
107
)
#See http://oeis.org/A060843
as.list(
busy_beaver)
## [[1]] ## [1] 1 ## ## [[2]] ## [1] 6 ## ## [[3]] ## [1] 21 ## ## [[4]] ## [1] 107
If each element of the list contains a scalar value, then it is also possible to convert that list to a vector using the functions that we have already seen (as.numeric
, as.character
, and so on):
as.numeric(
list(
1
,
6
,
21
,
107
))
## [1] 1 6 21 107
This technique won’t work in cases where the list contains nonscalar elements. This is a real issue, because as well as storing different types of data, lists are very useful for storing data of the same type, but with a nonrectangular shape:
(
prime_factors<-
list(
two=
2
,
three=
3
,
four=
c(
2
,
2
),
five=
5
,
six=
c(
2
,
3
),
seven=
7
,
eight=
c(
2
,
2
,
2
),
nine=
c(
3
,
3
),
ten=
c(
2
,
5
)
))
## $two ## [1] 2 ## ## $three ## [1] 3 ## ## $four ## [1] 2 2 ## ## $five ## [1] 5 ## ## $six ## [1] 2 3 ## ## $seven ## [1] 7 ## ## $eight ## [1] 2 2 2 ## ## $nine ## [1] 3 3 ## ## $ten ## [1] 2 5
This sort of list can be converted to a vector using the function unlist
(it is sometimes technically possible to do this with mixed-type lists, but rarely useful):
unlist(
prime_factors)
## two three four1 four2 five six1 six2 seven eight1 eight2 ## 2 3 2 2 5 2 3 7 2 2 ## eight3 nine1 nine2 ten1 ten2 ## 2 3 3 2 5
The c
function that we have used for concatenating vectors also works for concatenating lists:
c(
list(
a=
1
,
b=
2
),
list(
3
))
## $a ## [1] 1 ## ## $b ## [1] 2 ## ## [[3]] ## [1] 3
If we use it to concatenate lists and vectors, the vectors are converted to lists (as though as.list
had been called on them) before the concatenation occurs:
c(
list(
a=
1
,
b=
2
),
3
)
## $a ## [1] 1 ## ## $b ## [1] 2 ## ## [[3]] ## [1] 3
It is also possible to use the cbind
and rbind
functions on lists, but the resulting objects are very strange indeed. They are matrices with possibly nonscalar elements, or lists with dimensions, depending upon which way you want to look at them:
(
matrix_list_hybrid<-
cbind(
list(
a=
1
,
b=
2
),
list(
c=
3
,
list(
d=
4
))
))
## [,1] [,2] ## a 1 3 ## b 2 List,1
str(
matrix_list_hybrid)
## List of 4 ## $ : num 1 ## $ : num 2 ## $ : num 3 ## $ :List of 1 ## ..$ d: num 4 ## - attr(*, "dim")= int [1:2] 2 2 ## - attr(*, "dimnames")=List of 2 ## ..$ : chr [1:2] "a" "b" ## ..$ : NULL
Using cbind
and rbind
in this way is something you shouldn’t do often, and probably not at all. It’s another case of R being a little too flexible and accommodating, instead of telling you that you’ve done something silly by throwing an error.
NULL
is a special value that represents an empty variable. Its most common use is in lists, but it also crops up with data frames and function arguments. These other uses will be discussed later.
When you create a list, you may wish to specify that an element should exist, but should have no contents. For example, the following list contains UK bank holidays[17] for 2013 by month. Some months have no bank holidays, so we use NULL
to represent this absence:
(
uk_bank_holidays_2013<-
list(
Jan=
"New Year's Day"
,
Feb=
NULL
,
Mar=
"Good Friday"
,
Apr=
"Easter Monday"
,
May=
c(
"Early May Bank Holiday"
,
"Spring Bank Holiday"
),
Jun=
NULL
,
Jul=
NULL
,
Aug=
"Summer Bank Holiday"
,
Sep=
NULL
,
Oct=
NULL
,
Nov=
NULL
,
Dec=
c(
"Christmas Day"
,
"Boxing Day"
)
))
## $Jan ## [1] "New Year's Day" ## ## $Feb ## NULL ## ## $Mar ## [1] "Good Friday" ## ## $Apr ## [1] "Easter Monday" ## ## $May ## [1] "Early May Bank Holiday" "Spring Bank Holiday" ## ## $Jun ## NULL ## ## $Jul ## NULL ## ## $Aug ## [1] "Summer Bank Holiday" ## ## $Sep ## NULL ## ## $Oct ## NULL ## ## $Nov ## NULL ## ## $Dec ## [1] "Christmas Day" "Boxing Day"
It is important to understand the difference between NULL
and the special missing value NA
. The biggest difference is that NA
is a scalar value, whereas NULL
takes up no space at all—it has length zero:
length(
NULL
)
## [1] 0
length(
NA
)
## [1] 1
You can test for NULL
using the function is.null
. Missing values are not null:
is.null(
NULL
)
## [1] TRUE
is.null(
NA
)
## [1] FALSE
The converse test doesn’t really make much sense. Since NULL
has length zero, we have nothing to test to see if it is missing:
is.na(
NULL
)
## Warning: is.na() applied to non-(list or vector) of type 'NULL'
## logical(0)
NULL
can also be used to remove elements of a list. Setting an element to NULL
(even if it already contains NULL
) will remove it. Suppose that for some reason we want to switch to an old-style Roman 10-month calendar, removing January and February:
uk_bank_holidays_2013$
Jan<-
NULL
uk_bank_holidays_2013$
Feb<-
NULL
uk_bank_holidays_2013
## $Mar ## [1] "Good Friday" ## ## $Apr ## [1] "Easter Monday" ## ## $May ## [1] "Early May Bank Holiday" "Spring Bank Holiday" ## ## $Jun ## NULL ## ## $Jul ## NULL ## ## $Aug ## [1] "Summer Bank Holiday" ## ## $Sep ## NULL ## ## $Oct ## NULL ## ## $Nov ## NULL ## ## $Dec ## [1] "Christmas Day" "Boxing Day"
To set an existing element to be NULL
, we cannot simply assign the value of NULL
, since that will remove the element. Instead, it must be set to list(NULL)
. Now suppose that the UK government becomes mean and cancels the summer bank holiday:
uk_bank_holidays_2013[
"Aug"
]
<-
list(
NULL
)
uk_bank_holidays_2013
## $Mar ## [1] "Good Friday" ## ## $Apr ## [1] "Easter Monday" ## ## $May ## [1] "Early May Bank Holiday" "Spring Bank Holiday" ## ## $Jun ## NULL ## ## $Jul ## NULL ## ## $Aug ## NULL ## ## $Sep ## NULL ## ## $Oct ## NULL ## ## $Nov ## NULL ## ## $Dec ## [1] "Christmas Day" "Boxing Day"
R has another sort of list, the pairlist. Pairlists are used internally to pass arguments into functions, but you should almost never have to actively use them. Possibly the only time[18] that you are likely to explicitly see a pairlist is when using formals
. That function returns a pairlist of the arguments of a function.
Looking at the help page for the standard deviation function, ?sd
, we see that it takes two arguments, a vector x
and a logical value na.rm
, which has a default value of FALSE
:
(
arguments_of_sd<-
formals(
sd))
## $x ## ## ## $na.rm ## [1] FALSE
class(
arguments_of_sd)
## [1] "pairlist"
For most practical purposes, pairlists behave like lists. The only difference is that a pairlist of length zero is NULL
, but a list of length zero is just an empty list:
pairlist()
## NULL
list()
## list()
Data frames are used to store spreadsheet-like data. They can either be thought of as matrices where each column can store a different type of data, or nonnested lists where each element is of the same length.
We create data frames with the data.frame
function:
(
a_data_frame<-
data.frame(
x=
letters[
1
:5
],
y=
rnorm(
5
),
z=
runif(
5
)
>
0.5
))
## x y z ## 1 a 0.17581 TRUE ## 2 b 0.06894 TRUE ## 3 c 0.74217 TRUE ## 4 d 0.72816 TRUE ## 5 e -0.28940 TRUE
class(
a_data_frame)
## [1] "data.frame"
Notice that each column can have a different type than the other columns, but that all the elements within a column are the same type. Also notice that the class of the object is data.frame
, with a dot rather than a space.
In this example, the rows have been automatically numbered from one to five. If any of the input vectors had names, then the row names would have been taken from the first such vector. For example, if y
had names, then those would be given to the data frame:
y<-
rnorm(
5
)
names(
y)
<-
month.name[
1
:5
]
data.frame(
x=
letters[
1
:5
],
y=
y,
z=
runif(
5
)
>
0.5
)
## x y z ## January a -0.9373 FALSE ## February b 0.7314 TRUE ## March c -0.3030 TRUE ## April d -1.3307 FALSE ## May e -0.6857 FALSE
This behavior can be overridden by passing the argument row.names = NULL
to the data.frame
function:
data.frame(
x=
letters[
1
:5
],
y=
y,
z=
runif(
5
)
>
0.5
,
row.names=
NULL
)
## x y z ## 1 a -0.9373 FALSE ## 2 b 0.7314 FALSE ## 3 c -0.3030 TRUE ## 4 d -1.3307 TRUE ## 5 e -0.6857 FALSE
It is also possible to provide your own row names by passing a vector to row.names
. This vector will be converted to character
, if it isn’t already that type:
data.frame(
x=
letters[
1
:5
],
y=
y,
z=
runif(
5
)
>
0.5
,
row.names=
c(
"Jackie"
,
"Tito"
,
"Jermaine"
,
"Marlon"
,
"Michael"
)
)
## x y z ## Jackie a -0.9373 TRUE ## Tito b 0.7314 FALSE ## Jermaine c -0.3030 TRUE ## Marlon d -1.3307 FALSE ## Michael e -0.6857 FALSE
The row names can be retrieved or changed at a later date, in the same manner as with matrices, using rownames
(or row.names
). Likewise, colnames
and dimnames
can be used to get or set the column and dimension names, respectively. In fact, more or less all the functions that can be used to inspect matrices can also be used with data frames. nrow
, ncol
, and dim
also work in exactly the same way as they do in matrices:
rownames(
a_data_frame)
## [1] "1" "2" "3" "4" "5"
colnames(
a_data_frame)
## [1] "x" "y" "z"
dimnames(
a_data_frame)
## [[1]] ## [1] "1" "2" "3" "4" "5" ## ## [[2]] ## [1] "x" "y" "z"
nrow(
a_data_frame)
## [1] 5
ncol(
a_data_frame)
## [1] 3
dim(
a_data_frame)
## [1] 5 3
There are two quirks that you need to be aware of. First, length
returns the same value as ncol
, not the total number of elements in the data frame. Likewise, names
returns the same value as colnames
. For clarity of code, I recommend that you avoid these two functions, and use ncol
and colnames
instead:
length(
a_data_frame)
## [1] 3
names(
a_data_frame)
## [1] "x" "y" "z"
It is possible to create a data frame by passing different lengths of vectors, as long as the lengths allow the shorter ones to be recycled an exact number of times. More technically, the lowest common multiple of all the lengths must be equal to the longest vector:
data.frame(
#lengths 1, 2, and 4 are OK
x=
1
,
#recycled 4 times
y=
2
:3
,
#recycled twice
z=
4
:7
#the longest input; no recycling
)
## x y z ## 1 1 2 4 ## 2 1 3 5 ## 3 1 2 6 ## 4 1 3 7
If the lengths are not compatible, then an error will be thrown:
data.frame(
#lengths 1, 2, and 3 cause an error
x=
1
,
#lowest common multiple is 6, which is more than 3
y=
2
:3
,
z=
4
:6
)
One other consideration when creating data frames is that by default the column names are checked to be unique, valid variable names. This feature can be turned off by passing check.names = FALSE
to data.frame
:
data.frame(
"A column"
=
letters[
1
:5
],
"!@#$%^&*()"
=
rnorm(
5
),
"..."
=
runif(
5
)
>
0.5
,
check.names=
FALSE
)
## A column !@#$%^&*() ... ## 1 a 0.32940 TRUE ## 2 b -1.81969 TRUE ## 3 c 0.22951 FALSE ## 4 d -0.06705 TRUE ## 5 e -1.58005 TRUE
In general, having nonstandard column names is a bad idea. Duplicating column names is even worse, since it can lead to hard-to-find bugs once you start taking subsets. Turn off the column name checking at your own peril.
There are lots of different ways of indexing a data frame. To start with, pairs of the four different vector indices (positive integers, negative integers, logical values, and characters) can be used in exactly the same way as with matrices. These commands both select the second and third elements of the first two columns:
a_data_frame[
2
:3
,
-3
]
## x y ## 2 b 0.06894 ## 3 c 0.74217
a_data_frame[
c(
FALSE
,
TRUE
,
TRUE
,
FALSE
,
FALSE
),
c(
"x"
,
"y"
)]
## x y ## 2 b 0.06894 ## 3 c 0.74217
Since more than one column was selected, the resultant subset is also a data frame. If only one column had been selected, the result would have been simplified to be a vector:
class(
a_data_frame[
2
:3
,
-3
])
## [1] "data.frame"
class(
a_data_frame[
2
:3
,
1
])
## [1] "factor"
If we only want to select one column, then list-style indexing (double square brackets with a positive integer or name, or the dollar sign operator with a name) can also be used. These commands all select the second and third elements of the first column:
a_data_frame$
x[
2
:3
]
## [1] b c ## Levels: a b c d e
a_data_frame[[
1
]][
2
:3
]
## [1] b c ## Levels: a b c d e
a_data_frame[[
"x"
]][
2
:3
]
## [1] b c ## Levels: a b c d e
If we are trying to subset a data frame by placing conditions on columns, the syntax can get a bit clunky, and the subset
function provides a cleaner alternative. subset
takes up to three arguments: a data frame to subset, a logical vector of conditions for rows to include, and a vector of column names to keep (if this last argument is omitted, then all the columns are kept). The genius of subset
is that it uses special evaluation techniques to let you avoid doing some typing: instead of you having to type a_data_frame$y
to access the y
column of a_data_frame
, it already knows which data frame to look in, so you can just type y
. Likewise, when selecting columns, you don’t need to enclose the names of the columns in quotes; you can just type the names directly. In this next example, recall that |
is the operator for logical or:
a_data_frame[
a_data_frame$
y>
0
|
a_data_frame$
z,
"x"
]
## [1] a b c d e ## Levels: a b c d e
subset(
a_data_frame,
y>
0
|
z,
x)
## x ## 1 a ## 2 b ## 3 c ## 4 d ## 5 e
Like matrices, data frames can be transposed using the t
function, but in the process all the columns (which become rows) are converted to the same type, and the whole thing becomes a matrix:
t(
a_data_frame)
## [,1] [,2] [,3] [,4] [,5] ## x "a" "b" "c" "d" "e" ## y " 0.17581" " 0.06894" " 0.74217" " 0.72816" "-0.28940" ## z "TRUE" "TRUE" "TRUE" "TRUE" "TRUE"
Data frames can also be joined together using cbind
and rbind
, assuming that they have the appropriate sizes. rbind
is smart enough to reorder the columns to match. cbind
doesn’t check column names for duplicates, though, so be careful with it:
another_data_frame<-
data.frame(
#same cols as a_data_frame, different order
z=
rlnorm(
5
),
#lognormally distributed numbers
y=
sample(
5
),
#the numbers 1 to 5, in some order
x=
letters[
3
:7
]
)
rbind(
a_data_frame,
another_data_frame)
## x y z ## 1 a 0.17581 1.0000 ## 2 b 0.06894 1.0000 ## 3 c 0.74217 1.0000 ## 4 d 0.72816 1.0000 ## 5 e -0.28940 1.0000 ## 6 c 1.00000 0.8714 ## 7 d 3.00000 0.2432 ## 8 e 5.00000 2.3498 ## 9 f 4.00000 2.0263 ## 10 g 2.00000 1.7145
cbind(
a_data_frame,
another_data_frame)
## x y z z y x ## 1 a 0.17581 TRUE 0.8714 1 c ## 2 b 0.06894 TRUE 0.2432 3 d ## 3 c 0.74217 TRUE 2.3498 5 e ## 4 d 0.72816 TRUE 2.0263 4 f ## 5 e -0.28940 TRUE 1.7145 2 g
Where two data frames share columns, they can be merged together using the merge
function. merge
provides a variety of options for doing database-style joins. To join two data frames, you need to specify which columns contain the key values to match up. By default, the merge
function uses all the common columns from the two data frames, but more commonly you will just want to use a single shared ID column. In the following examples, we specify that the x
column contains our IDs using the by
argument:
merge(
a_data_frame,
another_data_frame,
by=
"x"
)
## x y.x z.x z.y y.y ## 1 c 0.7422 TRUE 0.8714 1 ## 2 d 0.7282 TRUE 0.2432 3 ## 3 e -0.2894 TRUE 2.3498 5
merge(
a_data_frame,
another_data_frame,
by=
"x"
,
all=
TRUE
)
## x y.x z.x z.y y.y ## 1 a 0.17581 TRUE NA NA ## 2 b 0.06894 TRUE NA NA ## 3 c 0.74217 TRUE 0.8714 1 ## 4 d 0.72816 TRUE 0.2432 3 ## 5 e -0.28940 TRUE 2.3498 5 ## 6 f NA NA 2.0263 4 ## 7 g NA NA 1.7145 2
Where a data frame has all numeric values, the functions colSums
and colMeans
can be used to calculate the sums and means of each column, respectively. Similarly, rowSums
and rowMeans
calculate the sums and means of each row:
colSums(
a_data_frame[,
2
:3
])
## y z ## 1.426 5.000
colMeans(
a_data_frame[,
2
:3
])
## y z ## 0.2851 1.0000
Manipulating data frames is a huge topic, and is covered in more depth in Chapter 13.
[]
, [[]]
, or $
.
NULL
is a special value that can be used to create “empty” list elements.
merge
lets you do database-style joins on data frames.
list(
alpha=
1
,
list(
beta=
2
,
gamma=
3
,
delta=
4
),
eta=
NULL
)
## $alpha ## [1] 1 ## ## [[2]] ## [[2]]$beta ## [1] 2 ## ## [[2]]$gamma ## [1] 3 ## ## [[2]]$delta ## [1] 4 ## ## ## $eta ## NULL
iris
to see the dataset.
Create a new data frame that consists of the numeric columns of the iris dataset, and calculate the means of its columns. [5]
beaver1
and beaver2
datasets contain body temperatures of two beavers. Add a column named id
to the beaver1
dataset, where the value is always 1. Similarly, add an id
column to beaver2
, with value 2. Vertically concatenate the two data frames and find the subset where either beaver is active. [10]