Chapter 5: Getting Started with Reading and Writing

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5

Getting Started with Reading and Writing

In This Chapter

Representing textual data with character vectors

Working with text

Creating, converting, and working with factors

It’s not for no reason that reading and writing are considered to be two of the three Rs in elementary education (reading, ’riting, and ’rithmetic). In this chapter, you get to work with words in R.

You assign text to variables. You manipulate these variables in many different ways, including finding text within text and concatenating different pieces of text into a single vector. You also use R functions to sort text and to find words in text with some powerful pattern search functions, called regular expressions. Finally, you work with factors, the R way of representing categories (or categorical data, as statisticians call it).

Using Character Vectors for Text Data

Text in R is represented by character vectors. A character vector is — you guessed it! — a vector consisting of characters. In Figure 5-1, you can see that each element of a character vector is a bit of text.

In the world of computer programming, text often is referred to as a string. In this chapter, we use the word text to refer to a single element of a vector, but you should be aware that the R Help files sometimes refer to strings and sometimes to text. They mean the same thing.

Figure 5-1: Each element of a character vector is a bit of text, also known as a string.

9781119963134-fg0501.eps

In this section, you take a look at how R uses character vectors to represent text. You assign some text to a character vector and get it to extract subsets of that data. You also get familiar with the very powerful concept of named vectors, vectors in which each element has a name. This is useful because you can then refer to the elements by name as well as position.

Assigning a value to a character vector

You assign a value to a character vector by using the assignment operator (<-), the same way you do for all other variables. You test whether a variable is of class character, for example, by using the is.character() function as follows:

> x <- “Hello world!”

> is.character(x)

TRUE

Notice that x is a character vector of length 1. To find out how many characters are in the text, use nchar:

> length(x)

[1] 1

> nchar(x)

[1] 12

This function tells you that x has length 1 and that the single element in x has 12 characters.

Creating a character vector with more than one element

To create a character vector with more than one element, use the combine function, c():

x <- c(“Hello”, “world!”)

> length(x)

[1] 2

> nchar(x)

[1] 5 6

Notice that this time, R tells you that your vector has length 2 and that the first element has five characters and the second element has six characters.

Extracting a subset of a vector

You use the same indexing rules for character vectors that you use for numeric vectors (or for vectors of any type). The process of referring to a subset of a vector through indexing its elements is also called subsetting. In other words, subsetting is the process of extracting a subset of a vector.

To illustrate how to work with vectors, and specifically how to create subsets, we use the built-in datasets letters and LETTERS. Both are character vectors consisting of the letters of the alphabet, in lowercase (letters) and uppercase (LETTERS). Try it:

> letters

[1] “a” “b” “c” “d” “e” “f” “g” “h” “i” “j” “k”

[12] “l” “m” “n” “o” “p” “q” “r” “s” “t” “u” “v”

[23] “w” “x” “y” “z”

> LETTERS

[1] “A” “B” “C” “D” “E” “F” “G” “H” “I” “J” “K”

[12] “L” “M” “N” “O” “P” “Q” “R” “S” “T” “U” “V”

[23] “W” “X” “Y” “Z”

Aside from being useful to illustrate the use of subsets in this chapter, you can use these built-in vectors whenever you need to make lists of things.

Let’s return to the topic of creating subsets. To extract a specific element from a vector, use square brackets. To get the tenth element of letters, for example, use the following:

> letters[10]

[1] “j”

To get the last three elements of LETTERS, use the following:

> LETTERS[24:26]

[1] “X” “Y” “Z”

The colon operator (:) in R is a handy way of creating sequences, so 24:26 results in 25, 25, 26. When this appears inside the square brackets, R returns elements 24 through 26.

In our last example, it was easy to extract the last three letters of LETTERS, because you know that the alphabet contains 26 letters. Quite often, you don’t know the length of a vector. You can use the tail() function to display the trailing elements of a vector. To get the last five elements of LETTERS, try the following:

> tail(LETTERS, 5)

[1] “V” “W” “X” “Y” “Z”

Similarly, you can use the head() function to get the first element of a variable. By default, both head() and tail() returns six elements, but you can tell it to return any specific number of elements in the second argument. Try extracting the first ten letters:

> head(letters, 10)

[1] “a” “b” “c” “d” “e” “f” “g” “h” “i” “j”

Naming the values in your vectors

Until this point in the book, we’ve referred to the elements of vectors by their positions — that is, x[5] refers to the fifth element in vector x. One very powerful feature in R, however, gives names to the elements of a vector, which allows you to refer to the elements by name.

You can use these named vectors in R to associate text values (names) with any other type of value. Then you can refer to these values by name in addition to position in the list. This format has a wide range of applications — for example, named vectors make it easy to create lookup tables.

Looking at how named vectors work

To illustrate named vectors, take a look at the built-in dataset islands, a named vector that contains the surface area of the world’s 48 largest land masses (continents and large islands). You can investigate its structure with str(), as follows:

> str(islands)

Named num [1:48] 11506 5500 16988 2968 16 ...

- attr(*, “names”)= chr [1:48] “Africa” “Antarctica” “Asia” “Australia” ...

R reports the structure of islands as a named vector with 48 elements. In the first line of the results of str(), you see the values of the first few elements of islands. On the second line, R reports that the named vector has an attribute containing names and reports that the first few elements are “Africa”, “Antarctica”, “Asia”, and “Australia”.

Because each element in the vector has a value as well as a name, now you can subset the vector by name. To retrieve the sizes of Asia, Africa, and Antarctica, use the following:

> islands[c(“Asia”, “Africa”, “Antarctica”)]

Asia Africa Antarctica

16988 11506 5500

You use the names() function to retrieve the names in a named vector:

> names(islands)[1:9]

[1] “Africa” “Antarctica” “Asia”

[4] “Australia” “Axel Heiberg” “Baffin”

[7] “Banks” “Borneo” “Britain”

This function allows you to do all kinds of interesting things. Imagine you wanted to know the names of the six largest islands. To do this, you would retrieve the names of islands after sorting it in decreasing order:

> names(sort(islands, decreasing=TRUE)[1:6])

[1] “Asia” “Africa” “North America”

[4] “South America” “Antarctica” “Europe”

Creating and assigning named vectors

You use the assignment operator (<-) to assign names to vectors in much the same way that you assign values to character vectors (see “Assigning a value to a character vector,” earlier in this chapter).

Imagine you want to create a named vector with the number of days in each month. First, create a numeric vector containing the number of days in each month. Then use the built-in dataset month.name for the month names, as follows:

> month.days <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)

> names(month.days) <- month.name

> month.days

January February March April

31 28 31 30

May June July August

31 30 31 31

September October November December

30 31 30 31

Now you can use this vector to find the names of the months with 31 days:

> names(month.days[month.days==31])

[1] “January” “March” “May”

[4] “July” “August” “October”

[7] “December”

This technique works because you subset month.days to return only those values for which month.days equals 31, and then you retrieve the names of the resulting vector.

The double equal sign (==) indicates a test for equality (see Chapter 4). Make sure not to use the single equal sign (=) for equality testing. Not only will a single equal sign not work, but it can have strange side effects because R interprets a single equal sign as an assignment. In other words, the operator = in many cases is the same as <-.

Manipulating Text

When you have text, you need to be able to manipulate it, for example by splitting or combining words. You also may want to analyze your text to find out whether it contains certain keywords or patterns.

In this section, you work with the string splitting and concatenation functions of R. Concatenating (combining) strings is something that programmers do very frequently. For example, when you create a report of your results, it’s customary to combine descriptive text with the actual results of your analysis so that the reader of your results can easily digest it.

Finally, you start to work with finding words and patterns inside text, and you meet regular expressions, a powerful way of doing a wildcard search of text.

String theory: Combining and splitting strings

A collection of combined letters and words is called a string. Whenever you work with text, you need to be able to concatenate words (string them together) and split them apart. In R, you use the paste() function to concatenate and the strsplit() function to split. In this section, we show you how to use both functions.

Splitting text

First, create a character vector called pangram, and assign it the value “The quick brown fox jumps over the lazy dog”, as follows:

> pangram <- “The quick brown fox jumps over the lazy dog”

> pangram

[1] “The quick brown fox jumps over the lazy dog”

To split this text at the word boundaries (spaces), you can use strsplit() as follows:

> strsplit(pangram, “ “)

[[1]]

[1] “The” “quick” “brown” “fox” “jumps” “over” “the” “lazy” “dog”

Notice that the unusual first line of strsplit()’s output consists of [[1]]. Similar to the way that R displays vectors, [[1]] means that R is showing the first element of a list. Lists are extremely important concepts in R; they allow you to combine all kinds of variables. You can read more about lists in Chapter 7.

In the preceding example, this list has only a single element. Yes, that’s right: The list has one element, but that element is a vector.

To extract an element from a list, you have to use double square brackets. Split your pangram into words, and assign the first element to a new variable called words, using double-square-brackets ([[]]) subsetting, as follows:

words <- strsplit(pangram, “ “)[[1]]

> words

[1] “The” “quick” “brown” “fox” “jumps” “over” “the” “lazy” “dog”

To find the unique elements of a vector, including a vector of text, you use the unique() function. In the variable words, “the” appears twice: once in lowercase and once with the first letter capitalized. To get a list of the unique words, first convert words to lowercase and then use unique:

> unique(tolower(words))

[1] “the” “quick” “brown” “fox” “jumps” “over” “lazy”

[8] “dog”

Concatenating text

Now that you’ve split text, you can concatenate these elements so that they again form a single text string.

To concatenate text, you use the paste() function:

paste(“The”, “quick”, “brown”, “fox”)

[1] “The quick brown fox”

By default, paste() uses a blank space to concatenate the vectors. In other words, you separate elements with spaces. This is because paste() takes an argument that specifies the separator. The default for the sep argument is a space (“ “) — it defaults to separating elements with a blank space, unless you tell it otherwise.

When you use paste(), or any function that accepts multiple arguments, make sure that you pass arguments in the correct format. Take a look at this example, but notice that this time there is a c() function in the code:

paste(c(“The”, “quick”, “brown”, “fox”))

[1] “The” “quick” “brown” “fox”

What’s happening here? Why doesn’t paste() paste the words together? The reason is that, by using c(), you passed a vector as a single argument to paste(). The c() function combines elements into a vector. By default, paste() concatenates separate vectors — it doesn’t collapse elements of a vector.

For the same reason, paste(words) results in the following:

[1] “The” “quick” “brown” “FOX” “jumps” “over” “the” “lazy” “DOG”

The paste() function takes two optional arguments. The separator (sep) argument controls how different vectors get concatenated, and the collapse argument controls how a vector gets collapsed into itself, so to speak.

When you want to concatenate the elements of a vector by using paste(), you use the collapse argument, as follows:

paste(words, collapse=” “)

[1] “The quick brown FOX jumps over the lazy DOG”

The collapse argument of paste can take any character value. If you want to paste together text by using an underscore, use the following:

paste(words, collapse=”_”)

[1] “The_quick_brown_FOX_jumps_over_the_lazy_DOG”

You can use sep and collapse in the same paste call. In this case, the vectors are first pasted with sep and then collapsed with collapse. Try this:

> paste(LETTERS[1:5], 1:5, sep=”_”, collapse=”---”)

[1] “A_1---B_2---C_3---D_4---E_5”

What happens here is that you first concatenate the elements of each vector with an underscore (that is, A_1, B_2, and so on), and then you collapse the results into a single string with --- between each element.

The paste() function takes vectors as input and joins them together. If one vector is shorter than the other, R recycles (repeats) the shorter vector to match the length of the longer one — a powerful feature.

Suppose that you have five objects, and you want to label them “sample 1”, “sample 2”, and so on. You can do this by passing a short vector with the value sample and a long vector with the values 1:5 to paste(). In this example, the shorter vector is repeated five times:

> paste(“Sample”, 1:5)

[1] “Sample 1” “Sample 2” “Sample 3” “Sample 4” “Sample 5”

Sorting text

What do league tables, telephone directories, dictionaries, and the index pages of a book have in common? They present data in some sorted manner. Data can be sorted alphabetically or numerically, in ascending or descending order. Like any programming language, R makes it easy to compile lists of sorted and ordered data.

Recycling character vectors

When you perform operations on vectors of different lengths, R automatically adjusts the length of the shorter vector to match the longer one. This is called recycling, since R recycles the element of the shorter vector to create a new vector that matches the original long vector.

This feature is very powerful but can lead to confusion if you aren’t aware of it.

The rules for recycling character vectors are exactly the same as for numeric vectors (see Chapter 4).

Here are a few examples of vector recycling using paste:

> paste(c(“A”, “B”), c(1, 2, 3, 4), sep=”-”)

[1] “A-1” “B-2” “A-3” “B-4”

> paste(c(“A”), c(1, 2, 3, 4, 5), sep=”-”)

[1] “A-1” “A-2” “A-3” “A-4” “A-5”

See how in the first example A and B get recycled to match the vector of length four. In the second example, the single A also gets recycled — in this case, five times.

Because text in R is represented as character vectors, you can sort these vectors using the same functions as you use with numeric data. For example, to get R to sort the alphabet in reverse, use the sort() function:

> sort(letters, decreasing=TRUE)

[1] “z” “y” “x” “w” “v” “u” “t” “s” “r” “q” “p”

[12] “o” “n” “m” “l” “k” “j” “i” “h” “g” “f” “e”

[23] “d” “c” “b” “a”

Here you used the decreasing argument of sort().

The sort() function sorts a vector. It doesn’t sort the characters of each element of the vector. In other words, sort() doesn’t mangle the word itself. You can still read each of the words in words.

Try it on your vector words that you created in the previous paragraph:

> sort(words)

[1] “brown” “DOG” “FOX” “jumps” “lazy”

[6] “over” “quick” “the” “The”

R performs lexicographic sorting, as opposed to, for example, the C language, which sorts in ASCII order. This means that the sort order will depend on the locale of the machine the code runs on. In other words, the sort order may be different if the machine running R is configured to use Danish than it will if the machine is configured to use English. The R help file contains this description:

Beware of making any assumptions about the collation order: e.g., in Estonian, Z comes between S and T, and collation is not necessarily character-by-character — in Danish aa sorts as a single letter, after z.

In most cases, lexicographic sorting simply means that the sort order is independent of whether the string is in lowercase or uppercase. For more details, read the help text in ?sort as well as ?Comparison.

You can get help on any function by typing a question mark followed by the function name into the console. For other ways of getting help, refer to Chapter 11.

Finding text inside text

When you’re working with text, often you can solve problems if you’re able to find words or patterns inside text. Imagine you have a list of the states in the United States, and you want to find out which of these states contains the word New. Or, say you want to find out which state names consist of two words.

To solve the first problem, you need to search for individual words (in this case, the word New). And to solve the second problem, you need to search for multiple words. We cover both problems in this section.

Searching for individual words

To investigate this problem, you can use the built-in dataset states.names, which contains — you guessed it — the names of the states of the United States:

> head(state.names)

[1] “Alabama” “Alaska” “Arizona”

[4] “Arkansas” “California” “Colorado”

Broadly speaking, you can find substrings in text in two ways:

By position: For example, you can tell R to get three letters starting at position 5.

By pattern: For example, you can tell R to get substrings that match a specific word or pattern.

A pattern is a bit like a wildcard. In some card games, you may use the Joker card to represent any other card. Similarly, a pattern in R can contain words or certain symbols with special meanings.

Searching by position

If you know the exact position of a subtext inside a text element, you use the substr() function to return the value. To extract the subtext that starts at the third position and stops at the sixth position of state.name, use the following:

> head(substr(state.name, start=3, stop=6))

[1] “abam” “aska” “izon” “kans” “lifo” “lora”

Searching by pattern

To find substrings, you can use the grep() function, which takes two essential arguments:

pattern: The pattern you want to find.

x: The character vector you want to search.

Suppose you want to find all the states that contain the pattern New. Do it like this:

> grep(“New”, state.name)

[1] 29 30 31 32

The result of grep() is a numeric vector with the positions of each of the elements that contain the matching pattern. In other words, the 29th element of state.name contains the word New.

> state.name[29]

New Hampshire

Phew, that worked! But typing in the position of each matching text is going to be a lot of work. Fortunately, you can use the results of grep() directly to subset the original vector:

> state.name[grep(“New”, state.name)]

[1] “New Hampshire” “New Jersey”

[3] “New Mexico” “New York”

The grep() function is case sensitive — it only matches text in the same case (uppercase or lowercase) as your search pattern. If you search for the pattern “new” in lowercase, your search results are empty:

> state.name[grep(“new”, state.name)]

character(0)

Searching for multiple words

So, how do you find the names of all the states with more than one word? This is easy when you realize that you can frame the question by finding all those states that contain a space:

> state.name[grep(“ “, state.name)]

[1] “New Hampshire” “New Jersey”

[3] “New Mexico” “New York”

[5] “North Carolina” “North Dakota”

[7] “Rhode Island” “South Carolina”

[9] “South Dakota” “West Virginia”

The results include all the states that have two-word names, such as New Jersey, New York, North Carolina, South Dakota, and West Virginia.

You can see from this list that there are no state names that contain East. You can confirm this by doing another find:

> state.name[grep(“East”, state.name)]

character(0)

When the result of a character operation is an empty vector (that is, there is nothing in it), R represents it as character(0). Similarly, an empty, or zero-length, numeric vector is represented with integer(0) or numeric(0) (see Chapter 4).

R makes a distinction between NULL and an empty vector. NULL usually means something is undefined. This is subtly different from something that is empty. For example, a character vector that happens to have no elements is still a character vector, represented by character(0).

Getting a grip on grep

The name of the grep() function originated in the Unix world. It’s an acronym for Global Regular Expression Print. Regular expressions are a very powerful way of expressing patterns of matching text, usually in a very formal language. Whole books have been written about regular expressions. We give a very short introduction in “Revving up with regular expressions,” later in this chapter.

The function name grep() appears in many programming languages that deal with text and reporting. Perl, for example, is famous for its extensive grep functionality. For more information, check out Perl For Dummies, 4th Edition, by Paul Hoffman (Wiley).

Substituting text

The sub() function (short for substitute) searches for a pattern in text and replaces this pattern with replacement text. You use sub() to substitute text for text, and you use its cousin gsub() to substitute all occurrences of a pattern. (The g in gsub() stands for global.)

Suppose you have the sentence He is a wolf in cheap clothing, which is clearly a mistake. You can fix it with a gsub() substitution. The gsub() function takes three arguments: the pattern to find, the replacement pattern, and the text to modify:

> gsub(“cheap”, “sheep’s”, “A wolf in cheap clothing”)

[1] “A wolf in sheep’s clothing”

Another common type of problem that can be solved with text substitution is removing substrings. Removing substrings is the same as replacing the substring with empty text (that is, nothing at all).

Imagine a situation in which you have three file names in a vector: file_a.csv, file_b.csv, and file_c.csv. Your task is to extract the a, b, and c from those file names. You can do this in two steps: First, replace the pattern “file_” with nothing, and then replace the “.csv” with nothing. You’ll be left with your desired vector:

> x <- c(“file_a.csv”, “file_b.csv”, “file_c.csv”)

> y <- gsub(“file_”, “”, x)

> y

[1] “a.csv” “b.csv” “c.csv”

> gsub(“.csv”, “”, y)

[1] “a” “b” “c”

In “Revving up with regular expressions,” later in this chapter, you see how to perform these two substitutions in a single expression.

Revving up with regular expressions

Until this point, you’ve worked only with fixed expressions to find or substitute text. This is useful but also limited. R supports the concept of regular expressions, which allows you to search for patterns inside text.

Extending text functionality with stringr

After this quick tour through the text manipulation functions of R, you probably wonder why all these functions have such unmemorable names and seemingly diverse syntax. If so, you’re not alone. In fact, Hadley Wickham wrote a package available from CRAN that simplifies and standardizes working with text in R. This package is called stringr, and you can install it by using the R console or by choosing Tools⇒Install Packages in RStudio (see Chapter 3).

Remember: Although you have to install a package only once, you have to load it into the workspace using the library() function every time you start a new R session and plan to use the functions in that package.

install.packages(“stringr”)

library(stringr)

Here are some of the advantages of using stringr rather than the standard R functions:

Function names and arguments are consistent and more descriptive. For example, all stringr functions have names starting with str_ (such as str_detect() and str_replace()).

stringr has a more consistent way of dealing with cases with missing data or empty values.

stringr has a more consistent way of ensuring that input and output data are of the same type.

The stringr equivalent for grep() is str_detect(), and the equivalent for gsub() is str_replace_all().

As a starting point to explore stringr, you may find some of these functions useful:

str_detect(): Detects the presence or absence of a pattern in a string

str_extract(): Extracts the first piece of a string that matches a pattern

str_length(): Returns the length of a string (in characters)

str_locate(): Locates the position of the first occurrence of a pattern in a string

str_match(): Extracts the first matched group from a string

str_replace(): Replaces the first occurrence of a matched pattern in a string

str_split(): Splits up a string into a variable number of pieces

str_sub(): Extracts substrings from a character vector

str_trim(): Trims white space from the start and end of string

str_wrap(): Wraps strings into nicely formatted paragraphs

You may never have heard of regular expressions, but you’re probably familiar with the broad concept. If you’ve ever used an * or a ? to indicate any letter in a word, then you’ve used a form of wildcard search. Regular expressions support the idea of wildcards and much more.

Regular expressions allow three ways of making a search pattern more general than a single, fixed expression:

Alternatives: You can search for instances of one pattern or another, indicated by the | symbol. For example beach|beech matches both beach and beech.

On English and American English keyboards, you can usually find the | on the same key as backslash ().

Grouping: You group patterns together using parentheses ( ). For example you write be(a|e)ch to find both beach and beech.

Quantifiers: You specify whether an element in the pattern must be repeated or not by adding * (occurs zero or many times) or + (occurs one or many times). For example, to find either bach or beech (zero or more of a and e but not both), you use b(e*|a*)ch.

Try the following examples. First, create a new variable with five words:

> rwords <- c(“bach”, “back”, “beech”, “beach”, “black”)

Find either beach or beech using alternative matching:

> grep(“beach|beech”, rwords)

[1] 3 4

This means the search string was found in elements 3 and 4 of rwords. To extract the actual elements, you can use subsetting with square brackets:

> rwords[grep(“beach|beech”, rwords)]

[1] “beech” “beach”

Now use the grouping rule to extract the same words:

> rwords[grep(“be(a|e)ch”, rwords)]

[1] “beech” “beach”

Lastly, use the quantifier modification to extract bach and beech but not beach:

rwords[grep(“b(e*|a*)ch”, rwords)]

[1] “bach” “beech”

To find more help in R about regular expressions, look at the Help page ?regexp. Some other great resources for learning more about regular expressions are Wikipedia (http://en.wikipedia.org/wiki/Regular_expression) and www.regular-expressions.info, where you can find a quick-start guide and tutorials.

Factoring in Factors

In real-world problems, you often encounter data that can be described using words rather than numerical values. For example, cars can be red, green, or blue (or any other color); people can be left-handed or right-handed, male or female; energy can be derived from coal, nuclear, wind, or wave power. You can use the term categorical data to describe these examples — or anything else that can be classified in categories.

R has a special data structure for categorical data, called factors. Factors are closely related to characters because any character vector can be represented by a factor.

Factors are special types of objects in R. They’re neither character vectors nor numeric vectors, although they have some attributes of both. Factors behave a little bit like character vectors in the sense that the unique categories often are text. Factors also behave a little bit like integer vectors because R encodes the levels as integers.

Creating a factor

To create a factor in R, you use the factor() function. The first three arguments of factor() warrant some exploration:

x: The input vector that you want to turn into a factor.

levels: An optional vector of the values that x might have taken. The default is lexicographically sorted, unique values of x.

labels: Another optional vector that, by default, takes the same values as levels. You can use this argument to rename your levels, as we explain in the next paragraph.

The fact that you can supply both levels and labels to factor can lead to confusion. Just remember that levels refers to the input values of x, while labels refers to the output values of the new factor.

Consider the following example of a vector consisting of compass directions:

> directions <- c(“North”, “East”, “South”, “South”)

Notice that this vector contains the value “South” twice and lacks the value “West”. First, convert directions to a factor:

> factor(directions)

[1] North East South South

Levels: East North South

Notice that the levels of your new factor does not contain the value “West”, which is as expected. In practice, however, it makes sense to have all the possible compass directions as levels of your factor. To add the missing level, you specify the levels arguments of factor:

> factor(directions, levels= c(“North”, “East”, “South”, “West”))

[1] North East South South

Levels: North East South West

As you can see, the values are still the same but this time the levels also contain “West”.

Now imagine that you actually prefer to have abbreviated names for the levels. To do this, you make use of the labels argument:

> factor(directions, levels= c(“North”, “East”, “South”, “West”), labels=c(“N”, “E”, “S”, “W”))

[1] N E S S

Levels: N E S W

Converting a factor

Sometimes you need to explicitly convert factors to either text or numbers. To do this, you use the functions as.character() or as.numeric().

First, convert your directions vector into a factor called directions.factor (as you saw earlier):

> directions <- c(“North”, “East”, “South”, “South”)

> directions.factor <- factor(directions)

> directions.factor

[1] North East South South

Levels: East North South

Use as.character() to convert a factor to a character vector:

> as.character(directions.factor)

[1] “North” “East” “South” “South”

Use as.numeric() to convert a factor to a numeric vector. Note that this will return the numeric codes that correspond to the factor levels. For example, “East” corresponds to 1, “North” corresponds to 2, and so forth:

> as.numeric(directions.factor)

[1] 2 1 3 3

Be very careful when you convert factors with numeric levels to a numeric vector. The results may not be what you expect.

For example, imagine you have a vector that indicates some test score results with the values c(9, 8, 10, 8, 9), which you convert to a factor:

> numbers <- factor(c(9, 8, 10, 8, 9))

To look at the internal representation of numbers, use str():

> str(numbers)

Factor w/ 3 levels “8”,”9”,”10”: 2 1 3 1 2

This indicates that R stores the values as c(2, 1, 3, 1, 2) with associated levels of c(“8”, “9”, “10”). Figure 5-2 gives a graphical representation of this difference between the levels and the internal representation.

Figure 5-2: A visual comparison between a numeric vector and a factor.

9781119963134-fg0502.eps

If you want to convert numbers to a character vector, the results are pretty much as you would expect:

> as.character(numbers)

[1] “9” “8” “10” “8” “9”

However, if you simply use as. numeric(), your result is a vector of the internal level representations of your factor and not the original values:

> as.numeric(numbers)

[1] 2 1 3 1 2

The R help at ?factor describes a solution to this problem. The solution is to index the levels by the factor itself, and then to convert to numeric:

> as.numeric(as.character(numbers))

[1] 9 8 10 8 9

This is an example of nested functions in R, in which you pass the results of one function to a second function. Nested functions are a bit like the Russian nesting dolls, where each toy is inside the next:

The inner function, as.character(numbers), contains the text c(“8”, “9”, “10”).

The outer function, as.numeric(...), does the final conversion to c(9, 8, 10, 8, 9).

Looking at levels

To look a little bit under the hood of the structure of a factor, use the str() function:

> str(state.region)

Factor w/ 4 levels “Northeast”,”South”,..: 2 4 4 2 4 4 1 2 2 2 ...

R reports the structure of state.region as a factor with four levels. You can see that the first two levels are “Northeast” and “South”, but these levels are represented as integers 1, 2, 3, and 4.

Factors are a convenient way to describe categorical data. Internally a factor is stored as a numeric value associated with each level. This means you can set and investigate the levels of a factor separately from the values of the factor.

To look at the levels of a factor, you use the levels() function. For example, to extract the factor levels of state.region, use the following:

> levels(state.region)

[1] “Northeast” “South”

[3] “North Central” “West”

Because the values of the factor are linked to the levels, when you change the levels, you also indirectly change the values themselves. To make this clear, change the levels of state.region to the values “NE”, “S”, “NC”, and “W”:

> levels(state.region) <- c(“NE”, “S”, “NC”, “W”)

> head(state.region)

[1] S W W S W W

Levels: NE S NC W

Sometimes it’s useful to know the number of levels of a factor. The convenience function nlevels() extracts the number of levels from a factor:

> nlevels(state.region)

[1] 4

Because the levels of a factor are internally stored by R as a vector, you also can extract the number of levels using length:

> length(levels(state.region))

[1] 4

For the very same reason, you can index the levels of a factor using standard vector subsisting rules. For example, to extract the second and third factor levels, use the following:

> levels(state.region)[2:3]

[1] “S” “NC”

Distinguishing data types

In the field of statistics, being able to distinguish between variables of different types is very important. The type of data very often determines the type of analysis that can be performed. As a result, R offers the ability to explicitly classify data as follows:

Nominal data: This type of data, which you represent in R using factors, distinguishes between different categories, but there is no implied order between categories. Examples of nominal data are colors (red, green, blue), gender (male, female), and nationality (British, French, Japanese).

Ordinal data: Ordinal data is distinguished by the fact that there is some kind of natural order between elements but no indication of the relative size difference. Any kind of data that is possible to rank in order but not give exact values to is ordinal. For example, low < medium < high describes data that is ordered with three levels.

In market research, it’s very common to use a five-point scale to measure perceptions: strongly disagree < disagree < neutral < agree < strongly agree. This is also an example of ordinal data.

Another example is the use of the names of colors to indicate order, such as red < amber < green to indicate project status.

tip.eps In R, you use ordered factors to describe ordinal data. For more on ordered factors, see the “Working with ordered factors” section, later in this chapter.

Numeric data: You have numeric data when you can describe your data with numbers (for example, length, weight, or count). Numeric data has two subcategories.

• Interval scaled data: You have interval scaled data when the interval between adjacent units of measurement is the same, but the zero point is arbitrary. An everyday example of interval scaled data is our calendar system. Each year has the same length, but the zero point is arbitrary. In other words, time didn’t start in the year zero — we simply use a convenient year to start counting. This means you can add and subtract dates (and all other types of interval scaled data), but you can’t meaningfully divide dates. Other examples include longitude, as well as anything else where there can be disagreement about where the starting point is.

Other examples of interval scaled data can be found in social science research such as market research.

In R you can use integer or numeric objects to represent interval scaled data.

• Ratio scaled data: This is data where all kinds of mathematical operations are allowed, in particular the ability to multiply and divide (in other words, take ratios). Most data in physical sciences are ratio scaled — for example, length, mass, and speed. In R, you use numeric objects to represent ratio scaled data.

Working with ordered factors

Sometimes data has some kind of natural order in which some elements are in some sense “better” or “worse” than other elements, but at the same time it’s impossible to ascribe a meaningful value to these. An example is any situation where project status is described as low, medium, or high. A similar example is a traffic light that can be red, yellow, or green.

Summarizing categorical data

In most practical cases where you have categorized data, some values are repeated. As a practical example, consider the states of the United States. Each state is in one of four regions: Northeast, South, North Central, or West (at least according to R). Have a look at the built-in dataset state.region:

> head(state.region)

[1] South West West South West West

Levels: Northeast South North Central West

You can use the handy table() function to get a tabular summary of the values of a factor:

> table(state.region)

state.region

Northeast South North Central West

9 16 12 13

This tells you that the Northeast region has 9 states, the South region has 16 states, and so on.

The table() function works by counting the number of occurrences of each factor level. You can learn more about table() in the Help page at ?table.

The name for this type of data, where rank ordering is important is ordinal data. In R, there is a special data type for ordinal data. This type is called ordered factors and is an extension of factors that you’re already familiar with.

To create an ordered factor in R, you have two options:

Use the factor() function with the argument ordered=TRUE.

Use the ordered() function.

Say you want to represent the status of five projects. Each project has a status of low, medium, or high:

> status <- c(“Lo”, “Hi”, “Med”, “Med”, “Hi”)

Now create an ordered factor with this status data:

> ordered.status <- factor(status, levels=c(“Lo”, “Med”, “Hi”), ordered=TRUE)

> ordered.status

[1] Lo Hi Med Med Hi

Levels: Lo < Med < Hi

You can tell an ordered factor from an ordinary factor by the presence of directional signs (< or >) in the levels.

In R, there is a really big practical advantage to using ordered factors. A great many R functions recognize and treat ordered factors differently by printing results in the order that you expect. For example, compare the results of table(status) with table(ordered.status):

> table(status)

status

Hi Lo Med

2 1 2

Notice that the results are ordered alphabetically. However, the results of performing the same function on the ordered factor yields results that are easier to interpret because they’re now sorted in the order Lo, Med, Hi:

> table(ordered.status)

ordered.status

Lo Med Hi

1 2 2

R preserves the ordering information inherent in ordered factors. In Part V, you see how this becomes an essential tool to gain control over the appearance of bar charts.

Also, in statistical modeling, R applies the appropriate statistical transformation (of contrasts) when you have factors or ordered factors in your model. In Chapter 15, you do some statistical modeling with categorical variables.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 5: Getting Started with Reading and Writing

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 5: Getting Started with Reading and Writing