A data frame represents a set of data with a number of rows and columns. It looks like a matrix but its columns are not necessarily of the same type. This is consistent with the most commonly seen formats of datasets: each row, or data record, is described by multiple columns of various types.
The following table is an example that can be fully characterized by a data frame.
Name |
Gender |
Age |
Major |
Ken |
Male |
24 |
Finance |
Ashley |
Female |
25 |
Statistics |
Jennifer |
Female |
23 |
Computer Science |
To create a data frame, we can call data.frame()
and supply the data of each column by a vector of the corresponding type:
persons <- data.frame(Name = c("Ken", "Ashley", "Jennifer"), Gender = c("Male", "Female", "Female"), Age = c(24, 25, 23), Major = c("Finance", "Statistics", "Computer Science")) persons ## Name Gender Age Major ## 1 Ken Male 24 Finance ## 2 Ashley Female 25 Statistics ## 3 Jennifer Female 23 Computer Science
Note that creating a data frame is exactly the same as creating a list. This is because, in essence, a data frame is a list in which each element is a vector and represents a table column and has the same number of elements.
Other than creating a data frame from raw data, we can also create it from a list by calling either data.frame
directly or as.data.frame
:
l1 <- list(x = c(1, 2, 3), y = c("a", "b", "c")) data.frame(l1) ## x y ## 1 1 a ## 2 2 b ## 3 3 c as.data.frame(l1) ## x y ## 1 1 a ## 2 2 b ## 3 3 c
We can also create a data frame from a matrix with the same method:
m1 <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, byrow = FALSE) data.frame(m1) ## X1 X2 X3 ## 1 1 4 7 ## 2 2 5 8 ## 3 3 6 9 as.data.frame(m1) ## V1 V2 V3 ## 1 1 4 7 ## 2 2 5 8 ## 3 3 6 9
Note that the conversion also automatically assigns column names to the new data frame. In fact, as you may verify, if the matrix already has column names or row names, they will be preserved in the conversion.
Since a data frame is a list but also looks like a matrix, the ways we access these two types of objects both apply to a data frame:
df1 <- data.frame(id = 1:5, x = c(0, 2, 1, -1, -3), y = c(0.5, 0.2, 0.1, 0.5, 0.9)) df1 ## id x y ## 1 1 0 0.5 ## 2 2 2 0.2 ## 3 3 1 0.1 ## 4 4 -1 0.5 ## 5 5 -3 0.9
We can rename the columns and rows just like we do with a matrix:
colnames(df1) <- c("id", "level", "score") rownames(df1) <- letters[1:5] df1 ## id level score ## a 1 0 0.5 ## b 2 2 0.2 ## c 3 1 0.1 ## d 4 -1 0.5 ## e 5 -3 0.9
Since a data frame is a matrix-like list of column vectors, we can use both sets of notations to access the elements and subsets in a data frame.
If we would like to regard a data frame as a list of vectors, we can use list notations to extract a value or perform subsetting.
For example, we can use $
to extract the values of one column by name, or use [[
to do so by position:
df1$id ## [1] 1 2 3 4 5 df1[[1]] ## [1] 1 2 3 4 5
List subsetting perfectly applies to a data frame and also yields a new data frame. The subsetting operator ([
) allows us to use a numeric vector to extract columns by position, a character vector to extract columns by name, or a logical vector to extract columns by TRUE
and FALSE
selection:
df1[1] ## id ## a 1 ## b 2 ## c 3 ## d 4 ## e 5 df1[1:2] ## id level ## a 1 0 ## b 2 2 ## c 3 1 ## d 4 -1 ## e 5 -3 df1["level"] ## level ## a 0 ## b 2 ## c 1 ## d -1 ## e -3 df1[c("id", "score")] ## id score ## a 1 0.5 ## b 2 0.2 ## c 3 0.1 ## d 4 0.5 ## e 5 0.9 df1[c(TRUE, FALSE, TRUE)] ## id score ## a 1 0.5 ## b 2 0.2 ## c 3 0.1 ## d 4 0.5 ## e 5 0.9
However, the list notation does not support row selection. In contrast, the matrix notation provides more flexibility. If we view a data frame as a matrix, the two-dimensional accessor enables us to easily access an entry of a subset, which supports both column selection and row selection.
In other words, we can use the [row, column]
notation to subset a data frame by specifying the row selector and column selector, which can be numeric vectors, character vectors, and/or logical vectors.
For example, we can specify the column selector:
df1[, "level"] ## [1] 0 2 1 -1 -3 df1[, c("id", "level")] ## id level ## a 1 0 ## b 2 2 ## c 3 1 ## d 4 -1 ## e 5 -3 df1[, 1:2] ## id level ## a 1 0 ## b 2 2 ## c 3 1 ## d 4 -1 ## e 5 -3
Alternatively, we can specify the row selector:
df1[1:4,] ## id level score ## a 1 0 0.5 ## b 2 2 0.2 ## c 3 1 0.1 ## d 4 -1 0.5 df1[c("c", "e"),] ## id level score ## c 3 1 0.1 ## e 5 -3 0.9
We can even specify both selectors at the same time:
df1[1:4, "id"] ## [1] 1 2 3 4 df1[1:3, c("id", "score")] ## id score ## a 1 0.5 ## b 2 0.2 ## c 3 0.1
Note that the matrix notation automatically simplifies the output. That is, if only one column is selected, the result won't be a data frame but the values of that column. To always keep the result as a data frame, even if it only has a single column, we can use both notations together:
df1[1:4,]["id"] ## id ## a 1 ## b 2 ## c 3 ## d 4
Here, the first group of brackets subsets the data frame as a matrix with the first four rows and all columns selected. The second group of brackets subsets the resultant data frame as a list with only the id
column selected, which results in a data frame.
Another way is to specify drop = FALSE
to avoid simplifying the results:
df1[1:4, "id", drop = FALSE] ## id ## a 1 ## b 2 ## c 3 ## d 4
If you expect the output of a data frame subsetting to always be a data frame, you should always set drop = FALSE
; otherwise, some edge cases (like a user input selecting only one column) may end up in unexpected behaviors if you assume that you will get a data frame but actually get a vector.
The following code filters the rows of df1
by criterionscore >= 0.5
and selects the id
and level
columns:
df1$score >= 0.5 ## [1] TRUE FALSE FALSE TRUE TRUE df1[df1$score >= 0.5, c("id", "level")] ## id level ## a 1 0 ## d 4 -1 ## e 5 -3
The following code filters the rows of df1
by a criterion that the row name must be among a
, d
, or e
, and selects the id
and score
columns:
rownames(df1) %in% c("a", "d", "e") ## [1] TRUE FALSE FALSE TRUE TRUE df1[rownames(df1) %in% c("a", "d", "e"), c("id", "score")] ## id score ## a 1 0.5 ## d 4 0.5 ## e 5 0.9
Both of these examples basically use matrix notation to select rows by a logical vector and select columns by a character vector.
Setting the values of a subset of a data frame allows both methods working with a list and a matrix.
We can assign new values to a list member using $
and <-
together:
df1$score <- c(0.6, 0.3, 0.2, 0.4, 0.8) df1 ## id level score ## a 1 0 0.6 ## b 2 2 0.3 ## c 3 1 0.2 ## d 4 -1 0.4 ## e 5 -3 0.8
Alternatively, [
works too, and it also allows multiple changes in one expression in contrast to [[
, which only allows modifying one column at a time:
df1["score"] <- c(0.8, 0.5, 0.2, 0.4, 0.8) df1 ## id level score ## a 1 0 0.8 ## b 2 2 0.5 ## c 3 1 0.2 ## d 4 -1 0.4 ## e 5 -3 0.8 df1[["score"]] <- c(0.4, 0.5, 0.2, 0.8, 0.4) df1 ## id level score ## a 1 0 0.4 ## b 2 2 0.5 ## c 3 1 0.2 ## d 4 -1 0.8 ## e 5 -3 0.4 df1[c("level", "score")] <- list(level = c(1, 2, 1, 0, 0), score = c(0.1, 0.2, 0.3, 0.4, 0.5)) df1 ## id level score ## a 1 1 0.1 ## b 2 2 0.2 ## c 3 1 0.3 ## d 4 0 0.4 ## e 5 0 0.5
Using list notations to set values of a data frame has the same problem as subsetting–we can only access the columns. If we need to set values with more flexibility, we can use matrix notations:
df1[1:3, "level"] <- c(-1, 0, 1) df1 ## id level score ## a 1 -1 0.1 ## b 2 0 0.2 ## c 3 1 0.3 ## d 4 0 0.4 ## e 5 0 0.5 df1[1:2, c("level", "score")] <- list(level = c(0, 0), score = c(0.9, 1.0)) df1 ## id level score ## a 1 0 0.9 ## b 2 0 1.0 ## c 3 1 0.3 ## d 4 0 0.4 ## e 5 0 0.5
One thing to notice is that the default behavior of a data frame tries to use memory more efficiently. Sometimes, this behavior might silently lead to unexpected problems.
For example, when we create a data frame by supplying a character vector as a column, it will by default convert the character vector to a factor that only stores the same value once so that repetitions will not cost much memory. In fact, a factor is essentially an integer vector with a pre-specified set of possible values called levels to represent values of limited possibilities.
We can verify this by calling str()
on the data frame persons
we created in the beginning:
str(persons) ## 'data.frame': 3 obs. of 4 variables: ## $ Name : Factor w/ 3 levels "Ashley","Jennifer",..: 3 1 2 ## $ Gender: Factor w/ 2 levels "Female","Male": 2 1 1 ## $ Age : num 24 25 23 ## $ Major : Factor w/ 3 levels "Computer Science",..: 2 3 1
As we can clearly find out that Name
, Gender
, and Major
are not character vectors but factor objects. It is reasonable that Gender
is represented by a factor because it may only be either Female
or Male
, so using two integers to represent these two values is more efficient than using a character vector to store all the values regardless of the repetition.
However, it may induce problems for other columns not limited to taking several possible values. For example, if we want to set a name in persons
:
persons[1, "Name"] <- "John" ## Warning in `[<-.factor`(`*tmp*`, iseq, value = "John"): invalid factor ## level, NA generated persons ## Name Gender Age Major ## 1 <NA> Male 24 Finance ## 2 Ashley Female 25 Statistics ## 3 Jennifer Female 23 Computer Science
A warning message appears. This happens because in the initial Name
dictionary, there is no word called John
, therefore we cannot set the name of the first person to be such a non-existing value. The same thing happens when we set any Gender
to be Unknown
. The reason is exactly the same: when the column is initially created from a character vector when we define a data frame, the column will by default be a factor whose value must be taken from the dictionary created from the unique values in that character vector.
This behavior is sometimes very annoying and does not really help much, especially as memory is cheap today. The simplest way to avoid this behavior is to set stringsAsFactors = FALSE
when we create a data frame using data.frame()
:
persons <- data.frame(Name = c("Ken", "Ashley", "Jennifer"), Gender = factor(c("Male", "Female", "Female")), Age = c(24, 25, 23), Major = c("Finance", "Statistics", "Computer Science"), stringsAsFactors = FALSE) str(persons) ## 'data.frame': 3 obs. of 4 variables: ## $ Name : chr "Ken" "Ashley" "Jennifer" ## $ Gender: Factor w/ 2 levels "Female","Male": 2 1 1 ## $ Age : num 24 25 23 ## $ Major : chr "Finance" "Statistics" "Computer Science"
If we really want a factor object to play its role, we can explicitly call factor()
at specific columns, just like we did previously for the Gender
column.
There are many useful functions for a data frame. Here we only introduce a few but the most commonly used ones.
The summary()
function works with a data frame by generating a table that shows the summary statistics of each column:
summary(persons) ## Name Gender Age Major ## Length:3 Female:2 Min. :23.0 Length:3 ## Class :character Male :1 1st Qu.:23.5 Class :character ## Mode :character Median :24.0 Mode :character ## Mean :24.0 ## 3rd Qu.:24.5 ## Max. :25.0
For a factor Gender
, the summary counts the number of rows taking each value, or level. For a numeric vector, the summary shows the important quantiles of the numbers. For other types of columns, it shows the length, class, and mode of them. Another common demand is binding multiple data frames together by either row or column. For this purpose, we can use rbind()
and cbind()
which, as their names suggest, perform row binding and column binding respectively.
If we want to append some rows to a data frame, in this case, add a new record of a person, we can use rbind()
:
rbind(persons, data.frame(Name = "John", Gender = "Male", Age = 25, Major = "Statistics")) ## Name Gender Age Major ## 1 Ken Male 24 Finance ## 2 Ashley Female 25 Statistics ## 3 Jennifer Female 23 Computer Science ## 4 John Male 25 Statistics
If we want to append some columns to a data frame, in this case, add two new columns to indicate whether each person is registered and the number of projects in hand, we can use cbind()
:
cbind(persons, Registered = c(TRUE, TRUE, FALSE), Projects = c(3, 2, 3)) ## Name Gender Age Major Registered Projects ## 1 Ken Male 24 Finance TRUE 3 ## 2 Ashley Female 25 Statistics TRUE 2 ## 3 Jennifer Female 23 Computer Science FALSE 3
Note that rbind()
and cbind()
do not modify the original data but create a new data frame with given rows or columns appended.
Another useful function is expand.grid()
. This generates a data frame that includes all combinations of the values in the columns:
expand.grid(type = c("A", "B"), class = c("M", "L", "XL")) ## type class ## 1 A M ## 2 B M ## 3 A L ## 4 B L ## 5 A XL ## 6 B XL
There are many other useful functions working with data frames. We will discuss these functions in data manipulation chapters.
In practice, data is usually stored in files. R provides a number of functions to read a table from a file or write a data frame to a file. If a file stores a table, it is often well organized and follows some convention that specifies how rows and columns are arranged. In most cases, we don't have to read a file byte to byte but call functions such as read.table()
or read.csv()
.
The most popular software-neutral data format is CSV (Comma-Separated Values). The format is basically organized in a way that values in different columns are separated by a comma and the first row is by default regarded as the header. For example, persons may be represented in the following CSV format:
Name,Gender,Age,MajorKen,Male,24,FinanceAshley,Female,25,StatisticsJennifer,Female,23,Computer Science
To read the data into the R environment, we only need to call read.csv(file)
where the file is the path of the file. To ensure that the data file can be found, please place the data
folder directly in your working directory, call getwd()
to find out. We'll talk about this in detail in the next chapter:
read.csv("data/persons.csv") ## Name Gender Age Major ## 1 Ken Male 24 Finance ## 2 Ashley Female 25 Statistics ## 3 Jennifer Female 23 Computer Science
If we need to save a data frame to a CSV file, we may call write.csv(file)
with some additional arguments:
write.csv(persons, "data/persons.csv", row.names = FALSE, quote = FALSE)
The argument row.names = FALSE
avoids storing the row names which are not necessary, and the argumentquote = FALSE
avoids quoting text in the output, both of which in most cases are not necessary.
There are a number of built-in functions and several packages related to reading and writing data in different formats. We will cover this topic in later chapters.