Chapter 6. Working with Strings

In the previous chapter, you learned many built-in functions in several categories to work with basic objects. You learned how to access object classes, types, and dimensions; how to do logical, math, and basic statistical calculations; and how to perform simple analytic tasks such as root solving. These functions are the building blocks of our solution to specific problems.

String-related functions are a very important category of functions. They will be introduced in this chapter. In R, texts are stored in character vectors, and a good number of functions and techniques are useful to manipulate and analyze texts. In this chapter, you will learn the basics and useful techniques of working with strings, including the following topics:

  • Basic manipulation of character vectors
  • Converting between date/time objects and their string representations
  • Using regular expressions to extract information from texts

Getting started with strings

Character vectors in R are used to store text data. You previously learned that in contrast with many other programming languages, a character vector is not a vector of single characters, letters, or alphabet symbols such as a, b, c. Rather, it is a vector of strings.

R also provides a variety of built-in functions to deal with character vectors. Many of them also perform vectorized operations so they can process numerous string values in one step.

In this section, you will learn more about printing, combining, and transforming texts stored in character vectors.

Printing texts

Perhaps the most basic thing we can do with texts is to view them. R provides several ways to view texts in the console.

The simplest way is to directly type the string in quotation marks:

"Hello"
## [1] "Hello"

Like a numeric vector of floating numbers, a character vector is a vector of character values, or strings. Hello is in the first position and is the only element of the character vector we created earlier.

We can also print a string value stored in a variable by simply evaluating it:

str1 <- "Hello" 
str1
## [1] "Hello"

However, simply writing a character value in a loop does not print it iteratively. It does not print anything at all:

for (i in 1:3) {
  "Hello"
}

That's because R only automatically prints the value of an expression as it is being typed in the console. A for loop does not explicitly return a value. This behavior also explains the difference between the printing behaviors when the following two functions are called, respectively:

test1 <- function(x) {
  "Hello"
  x
}
test1("World")
## [1] "World"

In the preceding output, test1 does not print Hello, but it prints World because test1("World") returns the value of the last expression x, which is given as World, the value of the function call and R automatically prints this value. Let's assume we remove x from the function as follows:

test2 <- function(x) {
  "Hello" 
}
test2("World")
## [1] "Hello"

Then, test2 always returns Hello, no matter what value x takes. As a result, R automatically prints the value of expression test2("World"), that is, Hello.

If we want to explicitly print an object, we should use print():

print(str1)
## [1] "Hello"

Then, the character vector is printed with a position [1]. This works in a loop too:

for (i in 1:3) {
  print(str1) 
}
## [1] "Hello" 
## [1] "Hello" 
## [1] "Hello"

It also works in a function:

test3 <- function(x) {
  print("Hello")
  x
}
test3("World")
## [1] "Hello"
## [1] "World"

In some cases, we want the texts to appear as a message rather than a character vector with indices. In such cases, we can call cat() or message():

cat("Hello")
## Hello

We can construct the message in a more flexible way:

name <- "Ken"
language <- "R"
cat("Hello,", name, "- a user of", language)
## Hello, Ken - a user of R

We change the input to print a more formal sentence:

cat("Hello, ", name, ", a user of ", language, ".")
## Hello, Ken , a user of R .

It looks like the concatenated string appears to use unnecessary spaces between different arguments. It is because the space character is used by default as the separator between the input strings. We can change it by specifying the sep= argument. In the following example, we will avoid the default space separator and manually write spaces in the input strings to create a correct version:

cat("Hello, ", name, ", a user of ", language, ".", sep = "")
## Hello, Ken, a user of R.

An alternative function is message(), which is often used in serious situations such as an important event. The output text has a more conspicuous appearance. It is distinct from cat(), in that, it does not automatically use space separators to concatenate input strings:

message("Hello, ", name, ", a user of ", language, ".")
## Hello, Ken, a user of R.

Using message(), we need to write the separators manually in order to show the same text as earlier.

Another difference in the behavior between cat() and message() is that message() automatically ends the text with a new line while cat() does not.

The following two examples demonstrate the difference. We want to print the same contents but get different results:

for (i in 1:3) {
  cat(letters[[i]]) 
}
## abc
for (i in 1:3) {
  message(letters[[i]]) 
}
## a
## b
## c

It is obvious that each time cat() is called, it prints the input string without a new line appended. The effect is that the three letters show in the same line. By contrast, each time message() is called, it appends a new line to the input string. As a result, the three letters show in three lines. To print each letter in a new line using cat(), we need to explicitly add a new line character in the input. The following code prints exactly the same contents as message() did in the previous example:

for (i in 1:3) {
  cat(letters[[i]], "
", sep = "") 
}
## a 
## b 
## c

Concatenating strings

In practice, we often need to concatenate several strings to build a new one. The paste() function is used to concatenate several character vectors together. This function also uses space as the default separator:

paste("Hello", "world")
## [1] "Hello world"
paste("Hello", "world", sep = "-")
## [1] "Hello-world"

If we don't want the separator, we can set sep="" or alternatively call paste0():

paste0("Hello", "world")
## [1] "Helloworld"

Maybe you are confused by paste() and cat() because they both are capable of concatenating strings. But what's the difference? Although both functions concatenate strings, the difference is that cat() only prints the string to the console, but paste() returns the string for further uses. The following code demonstrates that cat() prints the concatenated string but returns NULL:

value1 <- cat("Hello", "world")
## Hello world
value1
## NULL

In other words, cat() only prints strings, but paste() creates a new character vector.

The previous examples show the behavior of paste() working with single-element character vectors. What about working with multi-element ones? Let's see how this is done:

paste(c("A", "B"), c("C", "D"))
## [1] "A C" "B D"

We can see that paste() works element-wise, that is, paste("A", "C") first, then paste("B", "D"), and finally, the results are collected to build a character vector of two elements.

If we want the results to be put together in one string, we can specify how these two elements are again concatenated by setting collapse=:

paste(c("A", "B"), c("C", "D"),collapse = ", ")
## [1] "A C, B D"

If we want to put them in two lines, we can set collapse to be  (new line):

result <- paste(c("A", "B"), c("C", "D"), collapse = "
") result
## [1] "A C
B D"

The new character vector result is a two-lined string, but the text representation of it is still written in one line. The new line is represented by  as we specified. To view the text we created, we need to call cat():

cat(result)
## A C ## B D

Now, the two-lined string is printed to the console in its intended format. The same thing also works with paste0().

Transforming texts

Turning texts into another form is useful in many cases. It is easy to perform a number of basic types of transformation on texts.

Changing cases

When we process data with texts, the input may not comply with our standard as supposed. For example, we expect all products to be graded in capital letters, from A to F, but the actual input may consist of these letters in both cases. Changing cases is useful to ensure that the input strings are consistent in cases.

The tolower() function changes the texts to lowercase letters, while toupper() does the opposite:

tolower("Hello")
## [1] "hello"
toupper("Hello")
## [1] "HELLO"

This is particularly useful when a function accepts character input. For example, we can define a function that returns x + y when type is add in all possible cases. It returns x * y when type is times in all possible cases. The best way to do it is to always convert type to lowercase or uppercase, no matter what the input value is:

calc <- function(type, x, y) {
  type <- tolower(type)
  if (type == "add") {
    x + y 
  }else if (type == "times") {
    x * y
  } else {
    stop("Not supported type of command")
  }
}
c(calc("add", 2, 3), calc("Add", 2, 3), calc("TIMES", 2, 3))
## [1] 5 5 6

This gives more tolerance to similar inputs only in different cases so that type is case-insensitive.

In addition, the two functions are vectorized, that is, it changes the cases of each string element of the given character vector:

toupper(c("Hello", "world"))
## [1] "HELLO" "WORLD"

Counting characters

Another useful function is nchar(), which simply counts the number of characters of each element of a character vector:

nchar("Hello")
## [1] 5

Like toupper() and tolower()nchar() is also vectorized:

nchar(c("Hello", "R", "User"))
## [1] 5 1 4

This function is often used to check whether an argument is supplied a valid string. For example, the following function takes some personal information of a student and stores it in the database:

store_student <- function(name, age) {
  stopifnot(length(name) == 1, nchar(name) >= 2,
    is.numeric(age), age > 0) 
  # store the information in the database 
}

Before storing the information in the database, the function uses stopifnot() to check whether name and age are provided valid values. If the user does not provide a meaningful name (for example, no less than two letters), the function would stop with an error:

store_student("James", 20)
store_student("P", 23)
## Error: nchar(name) >= 2 is not TRUE

Note that nchar(x) == 0 is equivalent to x == "". To check against an empty string, both methods work.

Trimming leading and trailing whitespaces

In the previous example, we used nchar() to check whether name is valid. However, sometimes, the input data comes with useless whitespaces. This adds more noise to the data and requires a careful checking of string arguments. For example, store_student() in the previous section makes pass of a name such as " P", which is as invalid as a straight "P" argument, but nchar(" P") returns 3:

store_student(" P", 23)

To take the possibility into account, we need to refine store_student. In R 3.2.0, trimws() is introduced to trim leading and/or trailing whitespaces of given strings:

store_student2 <- function(name, age) {
  stopifnot(length(name) == 1, nchar(trimws(name)) >= 2,
    is.numeric(age), age > 0) 
  # store the information in the database 
}

Now, the function is more robust to noisy data:

store_student2(" P", 23)
## Error: nchar(trimws(name)) >= 2 is not TRUE

The function, by default, trims both the leading and trailing whitespaces, which can be spaces and tabs. You can specify whether "left" or "right" to only trim one side of the strings:

trimws(c(" Hello", "World "), which = "left")
## [1] "Hello" "World "

Substring

In previous chapters, you learned how to subset vectors and lists. We can also subset the texts in a character vector by calling substr(). Suppose we have several dates in the following form:

dates <- c("Jan 3", "Feb 10", "Nov 15")

All the months are represented by three-letter abbreviations. We can use substr() to extract the months:

substr(dates, 1, 3)
## [1] "Jan" "Feb" "Nov"

To extract the day, we need to use substr() and nchar() together:

substr(dates, 5, nchar(dates))
## [1] "3" "10" "15"

Now that we can extract both months and days in the input strings, it is useful to write a function to transform the strings in such format to numeric values to represent the same date. The following function uses many functions and ideas you learned previously:

get_month_day <- function(x) {
  months <- vapply(substr(tolower(x), 1, 3), function(md) { 
    switch(md, jan = 1, feb = 2, mar = 3, apr = 4, may = 5,
    jun = 6, jul = 7, aug = 8, sep = 9, oct = 10, nov = 11, dec = 12) 
  }, numeric(1), USE.NAMES = FALSE) 
  days <- as.numeric(substr(x, 5,nchar(x)))
  data.frame(month = months, day = days) 
}
get_month_day(dates)
##   month day 
## 1   1    3 
## 2   2   10 
## 3  11   15

The substr() function also has a counterpart function to replace the substrings with a given character vector:

substr(dates, 1, 3) <- c("Feb", "Dec", "Mar") dates
## [1] "Feb 3" "Dec 10" "Mar 15"

Splitting texts

In many cases, the lengths of string parts to extract are not fixed. For example, person names such as "Mary Johnson" or "Jack Smiths" have no fixed lengths for the first names and last names. It is more difficult to use substr(), as you learned in the previous section, to separate and extract both parts. Texts in such format have a regular separator such as space or a comma. To extract the useful parts, we need to split the texts and make each part accessible. The strsplit() function is used to split texts by specific separators given as a character vector:

strsplit("a,bb,ccc", split = ",")
## [[1]] 
## [1] "a" "bb" "ccc"

The function returns a list. Each element in the list is a character vector produced from splitting that element in the original character vector. It is because strsplit(), like all previous string functions we have introduced, is also vectorized, that is, it returns a list of character vectors as a result of the splitting:

students <- strsplit(c("Tony, 26, Physics", "James, 25, Economics"),
split = ", ") 
students
## [[1]] 
## [1] "Tony" "26" "Physics" 
## 
## [[2]] 
## [1] "James" "25" "Economics"

The strsplit() function returns a list of character vectors containing split parts by working element-wise. In practice, splitting is only the first step to extract or reorganize data. To continue, we can use rbind to put the data into a matrix and give appropriate names to the columns:

students_matrix <- do.call(rbind, students)colnames(students_matrix) <- c("name", "age", "major")students_matrix
##       name   age   major 
## [1,] "Tony"  "26"  "Physics" 
## [2,] "James" "25"  "Economics"

Then, we will convert the matrix to a data frame so that we can transform each column to more proper types:

students_df <- data.frame(students_matrix, stringsAsFactors = FALSE)students_df$age <- as.numeric(students_df$age)students_df
##   name  age major
## 1 Tony  26  Physics
## 2 James 25  Economics

Now, raw string input students are transformed into a more organized and more useful data frame students_df.

One small trick to split the whole string into single characters is to use an empty split argument:

strsplit(c("hello", "world"), split = "")
## [[1]] 
## [1] "h" "e" "l" "l" "o" 
## 
## [[2]] 
## [1] "w" "o" "r" "l" "d"

In fact, strsplit() is more powerful than is shown. It also supports regular expressions, a very powerful framework to process text data. We will cover this topic in the last section of this chapter.

Formatting texts

Concatenating texts with paste() is sometimes not a good idea because the text has to be broken into pieces and it becomes harder to read as the format gets longer.

For example, let's assume we need to print each record in students_df in the following format:

#1, name: Tony, age: 26, major: Physics

In this case, using paste() will be a pain:

cat(paste("#", 1:nrow(students_df), ", name: ", students_df$name, ", age: ", students_df$age, ", major: ", students_df$major, sep = ""), sep = "
")
## #1, name: Tony, age: 26, major: Physics 
## #2, name: James, age: 25, major: Economics

The code looks messy, and it is hard to get the general template at first glance. By contrast, sprintf() supports a formatting template and solves the problem in a nice way:

cat(sprintf("#%d, name: %s, age: %d, major: %s", 
  1:nrow(students_df), students_df$name, students_df$age, 
students_df$major), sep = "
")
#1, name: Tony, age: 26, major: Physics
## #2, name: James, age: 25, major: Economics

In the preceding code, #%d, name: %s, age: %d, major: %s is the formatting template in which %d and %s are placeholders to represent the input arguments to appear in the string. The sprintf() function is especially easy to use because it prevents the template string from tearing apart, and each part to replace is specified as a function argument. In fact, this function uses C style formatting rules as described in detail at https://en.wikipedia.org/wiki/Printf_format_string.

In the preceding example, %s stands for string and %d for digits (integers). Moreover, sprintf() is also very flexible in formatting numeric values using %f. For example, %.1f means to round the number to 0.1:

sprintf("The length of the line is approximately %.1fmm", 12.295)
## [1] "The length of the line is approximately 12.3mm"

In fact, there is a formatting syntax of different types of values. The following table shows the most commonly used syntax:

Format

Output

sprintf("%s", "A")

A

sprintf("%d", 10)

10

sprintf("%04d", 10)

0010

sprintf("%f", pi)

3.141593

sprintf("%.2f", pi)

3.14

sprintf("%1.0f", pi)

3

sprintf("%8.2f", pi)

3.14

sprintf("%08.2f", pi)

00003.14

sprintf("%+f", pi)

+3.141593

sprintf("%e", pi)

3.141593e+00

sprintf("%E", pi)

3.141593E+00

Note

The official documentation (https://stat.ethz.ch/R-manual/R-devel/library/base/html/sprintf.html) gives a full description of the supported formats.

Note that % in the format text is a special character and will be interpreted as the initial character of a place holder. What if we really mean % in the string? To avoid formatting interpretation, we need to use %% to represent a literal %. The following code is an example:

sprintf("The ratio is %d%%", 10)
## [1] "The ratio is 10%"

Using Python string functions in R

The sprintf() function is powerful but not perfect for all use cases. For example, if some parts have to appear multiple times in the template, you will need to write the same arguments multiple times. This often makes the code more redundant and a bit hard to modify:

sprintf("%s, %d years old, majors in %s and loves %s.", "James", 25, "Physics", "Physics")
## [1] "James, 25 years old, majors in Physics and loves Physics."

There are other ways to represent the placeholders. The pystr package provides the pystr_format() function to format strings in Python formatting style using either numeric or named placeholders. The preceding example can be rewritten with this function in two ways:

One is using numeric placeholders:

# install.packages("pystr")
library(pystr)
pystr_format("{1}, {2} years old, majors in {3} and loves {3}.", "James", 25, "Physics", "Physics")
## [1] "James, 25 years old, majors in Physics and loves Physics."

The other is using named placeholders:

pystr_format("{name}, {age} years old, majors in {major} and loves {major}.", 
name = "James", age = 25, major = "Physics")
## [1] "James, 25 years old, majors in Physics and loves Physics."

In both cases, no argument has to repeat, and the position the input appears at can be easily moved to other places in the template string.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset