Data frames

A data frame represents a set of data with a number of rows and columns. It looks like a matrix but its columns are not necessarily of the same type. This is consistent with the most commonly seen formats of datasets: each row, or data record, is described by multiple columns of various types.

The following table is an example that can be fully characterized by a data frame.

Name

Gender

Age

Major

Ken

Male

24

Finance

Ashley

Female

25

Statistics

Jennifer

Female

23

Computer Science

Creating a data frame

To create a data frame, we can call data.frame() and supply the data of each column by a vector of the corresponding type:

persons <- data.frame(Name = c("Ken", "Ashley", "Jennifer"),
  Gender = c("Male", "Female", "Female"),
  Age = c(24, 25, 23),
  Major = c("Finance", "Statistics", "Computer Science"))
persons
##   Name     Gender  Age  Major
## 1 Ken      Male    24   Finance
## 2 Ashley   Female  25   Statistics
## 3 Jennifer Female  23   Computer Science

Note that creating a data frame is exactly the same as creating a list. This is because, in essence, a data frame is a list in which each element is a vector and represents a table column and has the same number of elements.

Other than creating a data frame from raw data, we can also create it from a list by calling either data.frame directly or as.data.frame:

l1 <- list(x = c(1, 2, 3), y = c("a", "b", "c"))
data.frame(l1)
##   x y
## 1 1 a
## 2 2 b
## 3 3 c
as.data.frame(l1)
##   x y
## 1 1 a
## 2 2 b
## 3 3 c

We can also create a data frame from a matrix with the same method:

m1 <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, byrow = FALSE)
data.frame(m1)
##   X1 X2 X3
## 1 1  4  7
## 2 2  5  8
## 3 3  6  9
as.data.frame(m1)
##   V1 V2 V3
## 1  1  4  7
## 2  2  5  8
## 3  3  6  9

Note that the conversion also automatically assigns column names to the new data frame. In fact, as you may verify, if the matrix already has column names or row names, they will be preserved in the conversion.

Naming rows and columns

Since a data frame is a list but also looks like a matrix, the ways we access these two types of objects both apply to a data frame:

df1 <- data.frame(id = 1:5, x = c(0, 2, 1, -1, -3), y = c(0.5, 0.2, 0.1, 0.5, 0.9))
df1
##   id  x   y
## 1  1  0  0.5
## 2  2  2  0.2
## 3  3  1  0.1
## 4  4 -1  0.5
## 5  5 -3  0.9

We can rename the columns and rows just like we do with a matrix:

colnames(df1) <- c("id", "level", "score")
rownames(df1) <- letters[1:5]
df1
##    id level score
## a   1    0    0.5
## b   2    2    0.2
## c   3    1    0.1
## d   4   -1    0.5
## e   5   -3    0.9

Subsetting a data frame

Since a data frame is a matrix-like list of column vectors, we can use both sets of notations to access the elements and subsets in a data frame.

Subsetting a data frame as a list

If we would like to regard a data frame as a list of vectors, we can use list notations to extract a value or perform subsetting.

For example, we can use $ to extract the values of one column by name, or use [[ to do so by position:

df1$id
## [1] 1 2 3 4 5
df1[[1]]
## [1] 1 2 3 4 5

List subsetting perfectly applies to a data frame and also yields a new data frame. The subsetting operator ([) allows us to use a numeric vector to extract columns by position, a character vector to extract columns by name, or a logical vector to extract columns by TRUE and FALSE selection:

df1[1]
##  id
## a 1
## b 2
## c 3
## d 4
## e 5
df1[1:2]
##  id level
## a 1  0
## b 2  2
## c 3  1
## d 4 -1
## e 5 -3
df1["level"]
##  level
## a  0
## b  2
## c  1
## d -1
## e -3
df1[c("id", "score")]
##  id score
## a 1  0.5
## b 2  0.2
## c 3  0.1
## d 4  0.5
## e 5  0.9
df1[c(TRUE, FALSE, TRUE)]
##   id score
## a 1  0.5
## b 2  0.2
## c 3  0.1
## d 4  0.5
## e 5  0.9

Subsetting a data frame as a matrix

However, the list notation does not support row selection. In contrast, the matrix notation provides more flexibility. If we view a data frame as a matrix, the two-dimensional accessor enables us to easily access an entry of a subset, which supports both column selection and row selection.

In other words, we can use the [row, column] notation to subset a data frame by specifying the row selector and column selector, which can be numeric vectors, character vectors, and/or logical vectors.

For example, we can specify the column selector:

df1[, "level"]
## [1] 0 2 1 -1 -3
df1[, c("id", "level")]
##  id level
## a 1  0
## b 2  2
## c 3  1
## d 4 -1
## e 5 -3
df1[, 1:2]
##  id level
## a 1  0
## b 2  2
## c 3  1
## d 4 -1
## e 5 -3

Alternatively, we can specify the row selector:

df1[1:4,]
##   id level score
## a  1   0    0.5
## b  2   2    0.2
## c  3   1    0.1
## d  4  -1    0.5
df1[c("c", "e"),]
##   id level score
## c  3   1    0.1
## e  5  -3    0.9

We can even specify both selectors at the same time:

df1[1:4, "id"]
## [1] 1 2 3 4
df1[1:3, c("id", "score")]
##   id score
## a  1  0.5
## b  2  0.2
## c  3  0.1

Note that the matrix notation automatically simplifies the output. That is, if only one column is selected, the result won't be a data frame but the values of that column. To always keep the result as a data frame, even if it only has a single column, we can use both notations together:

df1[1:4,]["id"]
##   id
## a 1
## b 2
## c 3
## d 4

Here, the first group of brackets subsets the data frame as a matrix with the first four rows and all columns selected. The second group of brackets subsets the resultant data frame as a list with only the id column selected, which results in a data frame.

Another way is to specify drop = FALSE to avoid simplifying the results:

df1[1:4, "id", drop = FALSE]
##   id
## a 1
## b 2
## c 3
## d 4

If you expect the output of a data frame subsetting to always be a data frame, you should always set drop = FALSE; otherwise, some edge cases (like a user input selecting only one column) may end up in unexpected behaviors if you assume that you will get a data frame but actually get a vector.

Filtering data

The following code filters the rows of df1 by criterionscore >= 0.5 and selects the id and level columns:

df1$score >= 0.5
## [1] TRUE FALSE FALSE TRUE TRUE
df1[df1$score >= 0.5, c("id", "level")]
##   id level
## a  1   0
## d  4  -1
## e  5  -3

The following code filters the rows of df1 by a criterion that the row name must be among a, d, or e, and selects the id and score columns:

rownames(df1) %in% c("a", "d", "e")
## [1] TRUE FALSE FALSE TRUE TRUE
df1[rownames(df1) %in% c("a", "d", "e"), c("id", "score")]
##   id score
## a  1  0.5
## d  4  0.5
## e  5  0.9

Both of these examples basically use matrix notation to select rows by a logical vector and select columns by a character vector.

Setting values

Setting the values of a subset of a data frame allows both methods working with a list and a matrix.

Setting values as a list

We can assign new values to a list member using $ and <- together:

df1$score <- c(0.6, 0.3, 0.2, 0.4, 0.8)
df1
##   id level score
## a 1    0    0.6
## b 2    2    0.3
## c 3    1    0.2
## d 4   -1    0.4
## e 5   -3    0.8

Alternatively, [ works too, and it also allows multiple changes in one expression in contrast to [[, which only allows modifying one column at a time:

df1["score"] <- c(0.8, 0.5, 0.2, 0.4, 0.8)
df1
##   id level score
## a 1   0     0.8
## b 2   2     0.5
## c 3   1     0.2
## d 4  -1     0.4
## e 5  -3     0.8
df1[["score"]] <- c(0.4, 0.5, 0.2, 0.8, 0.4)
df1
##   id level score
## a 1   0     0.4
## b 2   2     0.5
## c 3   1     0.2
## d 4  -1     0.8
## e 5  -3     0.4
df1[c("level", "score")] <- list(level = c(1, 2, 1, 0, 0), score = c(0.1, 0.2, 0.3, 0.4, 0.5))
df1
##   id level score
## a 1    1    0.1
## b 2    2    0.2
## c 3    1    0.3
## d 4    0    0.4
## e 5    0    0.5

Setting values as a matrix

Using list notations to set values of a data frame has the same problem as subsetting–we can only access the columns. If we need to set values with more flexibility, we can use matrix notations:

df1[1:3, "level"] <- c(-1, 0, 1)
df1
##   id level score
## a 1   -1   0.1
## b 2   0    0.2
## c 3   1    0.3
## d 4   0    0.4
## e 5   0    0.5
df1[1:2, c("level", "score")] <- list(level = c(0, 0), score = c(0.9, 1.0))
df1
##   id level score
## a 1   0    0.9
## b 2   0    1.0
## c 3   1    0.3
## d 4   0    0.4
## e 5   0    0.5

Factors

One thing to notice is that the default behavior of a data frame tries to use memory more efficiently. Sometimes, this behavior might silently lead to unexpected problems.

For example, when we create a data frame by supplying a character vector as a column, it will by default convert the character vector to a factor that only stores the same value once so that repetitions will not cost much memory. In fact, a factor is essentially an integer vector with a pre-specified set of possible values called levels to represent values of limited possibilities.

We can verify this by calling str() on the data frame persons we created in the beginning:

str(persons)
## 'data.frame': 3 obs. of 4 variables:
## $ Name : Factor w/ 3 levels "Ashley","Jennifer",..: 3 1 2
## $ Gender: Factor w/ 2 levels "Female","Male": 2 1 1
## $ Age : num 24 25 23
## $ Major : Factor w/ 3 levels "Computer Science",..: 2 3 1

As we can clearly find out that NameGender, and Major are not character vectors but factor objects. It is reasonable that Gender is represented by a factor because it may only be either Female or Male, so using two integers to represent these two values is more efficient than using a character vector to store all the values regardless of the repetition.

However, it may induce problems for other columns not limited to taking several possible values. For example, if we want to set a name in persons:

persons[1, "Name"] <- "John"
## Warning in `[<-.factor`(`*tmp*`, iseq, value = "John"): invalid factor
## level, NA generated
persons
##    Name    Gender Age  Major
## 1 <NA>     Male   24   Finance
## 2 Ashley   Female 25   Statistics
## 3 Jennifer Female 23   Computer Science

A warning message appears. This happens because in the initial Name dictionary, there is no word called John, therefore we cannot set the name of the first person to be such a non-existing value. The same thing happens when we set any Gender to be Unknown. The reason is exactly the same: when the column is initially created from a character vector when we define a data frame, the column will by default be a factor whose value must be taken from the dictionary created from the unique values in that character vector.

This behavior is sometimes very annoying and does not really help much, especially as memory is cheap today. The simplest way to avoid this behavior is to set stringsAsFactors = FALSE when we create a data frame using data.frame():

persons <- data.frame(Name = c("Ken", "Ashley", "Jennifer"),
  Gender = factor(c("Male", "Female", "Female")),
  Age = c(24, 25, 23),
  Major = c("Finance", "Statistics", "Computer Science"),
  stringsAsFactors = FALSE)
str(persons)
## 'data.frame': 3 obs. of 4 variables:
## $ Name : chr "Ken" "Ashley" "Jennifer"
## $ Gender: Factor w/ 2 levels "Female","Male": 2 1 1
## $ Age : num 24 25 23
## $ Major : chr "Finance" "Statistics" "Computer Science"

If we really want a factor object to play its role, we can explicitly call factor() at specific columns, just like we did previously  for the Gender column.

Useful functions for data frames

There are many useful functions for a data frame. Here we only introduce a few but the most commonly used ones.

The summary() function works with a data frame by generating a table that shows the summary statistics of each column:

summary(persons)
## Name Gender Age Major 
## Length:3 Female:2 Min. :23.0 Length:3 
## Class :character Male :1 1st Qu.:23.5 Class :character 
## Mode :character Median :24.0 Mode :character 
## Mean :24.0 
## 3rd Qu.:24.5 
## Max. :25.0

For a factor Gender, the summary counts the number of rows taking each value, or level. For a numeric vector, the summary shows the important quantiles of the numbers. For other types of columns, it shows the length, class, and mode of them. Another common demand is binding multiple data frames together by either row or column. For this purpose, we can use rbind() and cbind() which, as their names suggest, perform row binding and column binding respectively.

If we want to append some rows to a data frame, in this case, add a new record of a person, we can use rbind():

rbind(persons, data.frame(Name = "John", Gender = "Male", Age = 25, Major = "Statistics"))
##   Name     Gender Age Major
## 1 Ken      Male    24 Finance
## 2 Ashley   Female  25 Statistics
## 3 Jennifer Female  23 Computer Science
## 4 John     Male    25 Statistics

If we want to append some columns to a data frame, in this case, add two new columns to indicate whether each person is registered and the number of projects in hand, we can use cbind():

cbind(persons, Registered = c(TRUE, TRUE, FALSE), Projects = c(3, 2, 3))
##   Name    Gender  Age Major           Registered Projects
## 1 Ken      Male   24  Finance          TRUE         3
## 2 Ashley   Female 25  Statistics       TRUE         2
## 3 Jennifer Female 23  Computer Science FALSE        3

Note that rbind() and cbind() do not modify the original data but create a new data frame with given rows or columns appended.

Another useful function is expand.grid(). This generates a data frame that includes all combinations of the values in the columns:

expand.grid(type = c("A", "B"), class = c("M", "L", "XL"))
##   type class
## 1  A    M
## 2  B    M
## 3  A    L
## 4  B    L
## 5  A   XL
## 6  B   XL

There are many other useful functions working with data frames. We will discuss these functions in data manipulation chapters.

Loading and writing data on disk

In practice, data is usually stored in files. R provides a number of functions to read a table from a file or write a data frame to a file. If a file stores a table, it is often well organized and follows some convention that specifies how rows and columns are arranged. In most cases, we don't have to read a file byte to byte but call functions such as read.table() or read.csv().

The most popular software-neutral data format is CSV (Comma-Separated Values). The format is basically organized in a way that values in different columns are separated by a comma and the first row is by default regarded as the header. For example, persons may be represented in the following CSV format:

 Name,Gender,Age,MajorKen,Male,24,FinanceAshley,Female,25,StatisticsJennifer,Female,23,Computer Science

To read the data into the R environment, we only need to call read.csv(file) where the file is the path of the file. To ensure that the data file can be found, please place the data folder directly in your working directory, call getwd() to find out. We'll talk about this in detail in the next chapter:

read.csv("data/persons.csv")
##   Name     Gender Age Major
## 1 Ken      Male   24  Finance
## 2 Ashley   Female 25  Statistics
## 3 Jennifer Female 23  Computer Science

If we need to save a data frame to a CSV file, we may call write.csv(file) with some additional arguments:

write.csv(persons, "data/persons.csv", row.names = FALSE, quote = FALSE)

The argument row.names = FALSE avoids storing the row names which are not necessary, and the argumentquote = FALSE avoids quoting text in the output, both of which in most cases are not necessary.

There are a number of built-in functions and several packages related to reading and writing data in different formats. We will cover this topic in later chapters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset