As well as dealing with numbers and logical values, at some point you will almost certainly need to manipulate text. This is particularly common when you are retrieving or cleaning datasets. Perhaps you are trying to turn the text of a log file into meaningful values, or correct the typos in your data. These data-cleaning activities will be discussed in more depth in Chapter 13, but for now, you will learn how to manipulate character vectors.
Factors are used to store categorical data like gender (“male” or “female”) where there are a limited number of options for a string. They sometimes behave like character vectors and sometimes like integer vectors, depending upon context.
After reading this chapter, you should:
Text data is stored in character vectors (or, less commonly, character arrays). It’s important to remember that each element of a character vector is a whole string, rather than just an individual character. In R, “string” is an informal term that is used because “element of a character vector” is quite a mouthful.
The fact that the basic unit of text is a character vector means that most string manipulation functions operate on vectors of strings, in the same way that mathematical operations are vectorized.
As you’ve already seen, character vectors can be created with the c
function. We can use single or double quotes around our strings, as long as they match, though double quotes are considered more standard:
c(
"You should use double quotes most of the time"
,
'Single quotes are better for including " inside the string'
)
## [1] "You should use double quotes most of the time" ## [2] "Single quotes are better for including " inside the string"
The paste
function combines strings together. Each vector passed to it has its elements recycled to reach the length of the longest input, and then the strings are concatenated, with a space separating them. We can change the separator by passing an argument called sep
, or use the related function paste0
to have no separator. After all the strings are combined, the result can be collapsed into one string containing everything using the collapse
argument:
paste(
c(
"red"
,
"yellow"
),
"lorry"
)
## [1] "red lorry" "yellow lorry"
paste(
c(
"red"
,
"yellow"
),
"lorry"
,
sep=
"-"
)
## [1] "red-lorry" "yellow-lorry"
paste(
c(
"red"
,
"yellow"
),
"lorry"
,
collapse=
", "
)
## [1] "red lorry, yellow lorry"
paste0(
c(
"red"
,
"yellow"
),
"lorry"
)
## [1] "redlorry" "yellowlorry"
The function toString
is a variation of paste
that is useful for printing vectors. It separates each element with a comma and a space, and can limit how much we print. In the following example, width = 40
limits the output to 40 characters:
x<-
(
1
:15
)
^
2
toString(
x)
## [1] "1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225"
toString(
x,
width=
40
)
## [1] "1, 4, 9, 16, 25, 36, 49, 64, 81, 100...."
cat
is a low-level function that works similarly to paste
, but with less formatting. You should rarely need to call it directly yourself, but it is worth being aware of, since it is the basis for most of the print
functions. cat
can also accept a file
argument[21] to write its output to a file:
cat(
c(
"red"
,
"yellow"
),
"lorry"
)
## red yellow lorry
Usually, when strings are printed to the console they are shown wrapped in double quotes. By wrapping a variable in a call to the noquote
function, we can suppress those quotes. This can make the text more readable in some instances:
x<-
c(
"I"
,
"saw"
,
"a"
,
"saw"
,
"that"
,
"could"
,
"out"
,
"saw"
,
"any"
,
"other"
,
"saw"
,
"I"
,
"ever"
,
"saw"
)
y<-
noquote(
x)
x
## [1] "I" "saw" "a" "saw" "that" "could" "out" "saw" ## [9] "any" "other" "saw" "I" "ever" "saw"
y
## [1] I saw a saw that could out saw any other saw ## [12] I ever saw
There are several functions for formatting numbers. formatC
uses C-style formatting specifications that allow you to specify fixed or scientific formatting, the number of decimal places, and the width of the output. Whatever the options, the input should be one of the numeric
types (including arrays), and the output is a character
vector or array:
pow<-
1
:3
(
powers_of_e<-
exp(
pow))
## [1] 2.718 7.389 20.086
formatC(
powers_of_e)
## [1] "2.718" "7.389" "20.09"
formatC(
powers_of_e,
digits=
3
)
#3 sig figs
## [1] "2.72" "7.39" "20.1"
formatC(
powers_of_e,
digits=
3
,
width=
10
)
#preceding spaces
## [1] " 2.72" " 7.39" " 20.1"
formatC(
powers_of_e,
digits=
3
,
format=
"e"
)
#scientific formatting
## [1] "2.718e+00" "7.389e+00" "2.009e+01"
formatC(
powers_of_e,
digits=
3
,
flag=
"+"
)
#precede +ve values with +
## [1] "+2.72" "+7.39" "+20.1"
R also provides slightly more general C-style formatting with the function sprintf
. This works in the same way as sprintf
in every other language: the first argument contains placeholders for string or number variables, and further arguments are substituted into those placeholders. Just remember that most numbers in R are floating-point values rather than integers.
The first argument to sprintf
specifies a formatting string, with placeholders for other values. For example, %s
denotes another string, %f
and %e
denote a floating-point number in fixed or scientific format, respectively, and %d
represents an integer. Additional arguments specify the values to replace the placeholders. As with the paste
function, shorter inputs are recycled to match the longest input:
sprintf(
"%s %d = %f"
,
"Euler's constant to the power"
,
pow,
powers_of_e)
## [1] "Euler's constant to the power 1 = 2.718282" ## [2] "Euler's constant to the power 2 = 7.389056" ## [3] "Euler's constant to the power 3 = 20.085537"
sprintf(
"To three decimal places, e ^ %d = %.3f"
,
pow,
powers_of_e)
## [1] "To three decimal places, e ^ 1 = 2.718" ## [2] "To three decimal places, e ^ 2 = 7.389" ## [3] "To three decimal places, e ^ 3 = 20.086"
sprintf(
"In scientific notation, e ^ %d = %e"
,
pow,
powers_of_e)
## [1] "In scientific notation, e ^ 1 = 2.718282e+00" ## [2] "In scientific notation, e ^ 2 = 7.389056e+00" ## [3] "In scientific notation, e ^ 3 = 2.008554e+01"
Alternative syntaxes for formatting numbers are provided with the format
and prettyNum
functions. format
just provides a slightly different syntax for formatting strings, and has similar usage to formatC
. prettyNum
, on the other hand, is best for pretty formatting of very big or very small numbers:
format(
powers_of_e)
## [1] " 2.718" " 7.389" "20.086"
format(
powers_of_e,
digits=
3
)
#at least 3 sig figs
## [1] " 2.72" " 7.39" "20.09"
format(
powers_of_e,
digits=
3
,
trim=
TRUE
)
#remove leading zeros
## [1] "2.72" "7.39" "20.09"
format(
powers_of_e,
digits=
3
,
scientific=
TRUE
)
## [1] "2.72e+00" "7.39e+00" "2.01e+01"
prettyNum(
c(
1
e10,
1
e-
20
),
big.mark=
","
,
small.mark=
" "
,
preserve.width=
"individual"
,
scientific=
FALSE
)
## [1] "10,000,000,000" "0.00000 00000 00000 00001"
There are some special characters that can be included in strings. For example, we can insert a tab character using
. In the following example, we use cat
rather than print
, since print
performs an extra conversion to turn
from a tab character into a backslash and a “t.” The argument fill = TRUE
makes cat
move the cursor to a new line after it is finished:
cat(
"foo bar"
,
fill=
TRUE
)
## foo bar
Moving the cursor to a new line is done by printing a newline character,
(this is true on all platforms; don’t try to use
or
for printing newlines to the R command line, since
will just move the cursor to the start of the current line and overwrite what you have written):
cat(
"foo bar"
,
fill=
TRUE
)
## foo ## bar