Chapter 13. Manipulating Strings

Strings (character data) often need to be constructed or deconstructed to identify observations, preprocess text, combine information or satisfy any number of other needs. R offers functions for building strings, like paste and sprintf. It also provides a number of functions for using regular expressions and examining text data, although for those purposes it is better to use Hadley Wickham’s stringr package.

13.1. paste

The first function new R users reach for when putting together strings is paste. This function takes a series of strings, or expressions that evaluate to strings, and puts them together into one string. We start by putting together three simple strings.

> paste("Hello", "Jared", "and others")

[1] "Hello Jared and others"

Notice that spaces were put between the strings. This is because paste has a third argument, sep, that determines what to put in between entries. This can be any valid text, including empty text ("").

> paste("Hello", "Jared", "and others", sep = "/")

[1] "Hello/Jared/and others"

Like many functions in R, paste is vectorized. This means each element can be a vector of data to be put together.

> paste(c("Hello", "Hey", "Howdy"), c("Jared", "Bob", "David"))

[1] "Hello Jared" "Hey Bob"    "Howdy David"

In this case each vector had the same number of entries so they paired one-to-one. When the vectors do not have the same length they are recycled.

> paste("Hello", c("Jared", "Bob", "David"))

[1] "Hello Jared" "Hello Bob"   "Hello David"

> paste("Hello", c("Jared", "Bob", "David"), c("Goodbye", "Seeya"))

[1] "Hello Jared Goodbye" "Hello Bob Seeya"  "Hello David Goodbye"

paste also has the ability to collapse a vector of text into one vector containing all the elements with any arbitrary separator, using the collapse argument.

> vectorOfText <- c("Hello", "Everyone", "out there", ".")
> paste(vectorOfText, collapse = " ")

[1] "Hello Everyone out there ."

> paste(vectorOfText, collapse = "*")

[1] "Hello*Everyone*out there*."

13.2. sprintf

While paste is convenient for putting together short bits of text, it can become unwieldy when piecing together long pieces of text, such as when inserting a number of variables into a long piece of text. For instance, we might have a lengthy sentence that has a few spots that require the insertion of special variables. An example is “Hello Jared, your party of eight will be seated in 25 minutes” where “Jared,” “eight” and “25” could be replaced with other information.

Reforming this with paste can make reading the line in code difficult.

To start, we make some variables to hold the information.

> person <- "Jared"
> partySize <- "eight"
> waitTime <- 25

Now we build the paste expression.

> paste("Hello ", person, ", your party of ", partySize,
+       " will be seated in ", waitTime, " minutes.", sep="")

[1] "Hello Jared, your party of eight will be seated in 25 minutes."

Making even a small change to this sentence would require putting the commas in just the right places.

A good alternative is the sprintf function. With this function we build one long string with special markers indicating where to insert values.

> sprintf("Hello %s, your party of %s will be seated in %s minutes",
+     person, partySize, waitTime)

[1] "Hello Jared, your party of eight will be seated in 25 minutes"

Here, each %s was replaced with its corresponding variable. While the long sentence is easier to read in code, we must maintain the order of %s’s and variables.

sprintf is also vectorized. Note that the vector lengths must be multiples of each other.

> sprintf("Hello %s, your party of %s will be seated in %s minutes",
+     c("Jared", "Bob"), c("eight", 16, "four", 10), waitTime)

[1] "Hello Jared, your party of eight will be seated in 25 minutes"
[2] "Hello Bob, your party of 16 will be seated in 25 minutes"
[3] "Hello Jared, your party of four will be seated in 25 minutes"
[4] "Hello Bob, your party of 10 will be seated in 25 minutes"

13.3. Extracting Text

Often text needs to be ripped apart to be made useful, and while R has a number of functions for doing so, the stringr package is much easier to use.

First we need some data, so we use the XML package to download a table of United States presidents from Wikipedia.

> require(XML)

Then we use readHTMLTable to parse the table.

> load("data/presidents.rdata")

> theURL <- "http://www.loc.gov/rr/print/list/057_chron.html"
> presidents <- readHTMLTable(theURL, which=3, as.data.frame=TRUE,
+                             skip.rows=1, header=TRUE,
+                             stringsAsFactors=FALSE)

Now we take a look at the data.

> head(presidents)

       YEAR         PRESIDENT
1 1789-1797 George Washington
2 1797-1801        John Adams
3 1801-1805  Thomas Jefferson
4 1805-1809  Thomas Jefferson
5 1809-1812     James Madison
6 1812-1813     James Madison

                                      FIRST LADY   VICE PRESIDENT
1                              Martha Washington       John Adams
2                                  Abigail Adams Thomas Jefferson
3 Martha Wayles Skelton Jefferson    (no image)       Aaron Burr
4 Martha Wayles Skelton Jefferson    (no image)   George Clinton
5                                 Dolley Madison   George Clinton
6                                 Dolley Madison    office vacant

Examining it more closely, we see that the last few rows contain information we do not want, so we keep only the first 64 rows.

> tail(presidents$YEAR)

[1] "2001-2009"
[2] "2009-"
[3] "Presidents: Introduction (Rights/Ordering         Info.) | Adams
- Cleveland | Clinton - Harding Harrison       - Jefferson | Johnson
- McKinley | Monroe                         - Roosevelt | Taft - Truman |
Tyler                         - WilsonList of names, Alphabetically"
[4] "First Ladies: Introduction
(Rights/Ordering Info.) | Adams                   - Coolidge | Eisenhower
- HooverJackson                   - Pierce |
Polk - Wilson | List                 of names, Alphabetically"
[5] "Vice Presidents: Introduction (Rights/Ordering Info.) |
Adams - Coolidge | Curtis - Hobart Humphrey - Rockefeller | Roosevelt
- WilsonList of names, Alphabetically"
[6] "Top               of Page"

> presidents <- presidents[1:64, ]

To start, we create two new columns, one for the beginning of the term and one for the end of the term. To do this we need to split the Year column on the hyphen (-). The stringr package has the str split function that splits a string based on some value. It returns a list with an element for each element of the input vector. Each of these elements has as many elements as necessary for the split, in this case either two (a start and stop year) or one (when the president served less than one year).

> require(stringr)
> # split the string
> yearList <- str split(string = presidents$YEAR, pattern = "-")
> head(yearList)

[[1]]
[1] "1789" "1797"

[[2]]
[1] "1797" "1801"

[[3]]
[1] "1801" "1805"

[[4]]
[1] "1805" "1809"

[[5]]
[1] "1809" "1812"

[[6]]
[1] "1812" "1813"

> # combine them into one matrix
> yearMatrix <- data.frame(Reduce(rbind, yearList))
> head(yearMatrix)

    X1   X2
1 1789 1797
2 1797 1801
3 1801 1805
4 1805 1809
5 1809 1812
6 1812 1813

> # give the columns good names
> names(yearMatrix) <- c("Start", "Stop")
> # bind the new columns onto the data.frame
> presidents <- cbind(presidents, yearMatrix)
> # convert the start and stop columns into numeric
> presidents$Start <- as.numeric(as.character(presidents$Start))
> presidents$Stop <- as.numeric(as.character(presidents$Stop))
> # view the changes
> head(presidents)

       YEAR         PRESIDENT
1 1789-1797 George Washington
2 1797-1801        John Adams
3 1801-1805  Thomas Jefferson
4 1805-1809  Thomas Jefferson
5 1809-1812     James Madison
6 1812-1813     James Madison

                                      FIRST LADY    VICE PRESIDENT
1                              Martha Washington        John Adams
2                                  Abigail Adams  Thomas Jefferson
3 Martha Wayles Skelton Jefferson    (no image)        Aaron Burr
4 Martha Wayles Skelton Jefferson    (no image)    George Clinton
5                                 Dolley Madison    George Clinton
6                                 Dolley Madison     office vacant
   Start Stop
1   1789 1797
2   1797 1801
3   1801 1805
4   1805 1809
5   1809 1812
6   1812 1813

> tail(presidents)

        YEAR     PRESIDENT             FIRST LADY      VICE PRESIDENT
59 1977-1981  Jimmy Carter        Rosalynn Carter   Walter F. Mondale
60 1981-1989 Ronald Reagan           Nancy Reagan         George Bush
61 1989-1993   George Bush           Barbara Bush          Dan Quayle
62 1993-2001  Bill Clinton Hillary Rodham Clinton         Albert Gore
63 2001-2009 George W. Bush            Laura Bush      Richard Cheney
64     2009-   Barack Obama        Michelle Obama     Joseph R. Biden
   Start Stop
59  1977 1981
60  1981 1989
61  1989 1993
62  1993 2001
63  2001 2009
64  2009   NA

In the preceding example there was a quirk of R that can be frustrating at first pass. In order to convert the factor presidents$Start into a numeric, we first had to convert it into a character. That is because factors are simply labels on top of integers, as seen in Section 4.4.2. So when applying as.numeric to a factor, it is converted to the underlying integers.

Just like in Excel, it is possible to select specified characters from text using str sub.

> # get the first 3 characters
> str sub(string = presidents$PRESIDENT, start = 1, end = 3)

 [1] "Geo" "Joh" "Tho" "Tho" "Jam" "Jam" "Jam" "Jam" "Jam" "Joh" "And"
[12] "And" "Mar" "Wil" "Joh" "Jam" "Zac" "Mil" "Fra" "Fra" "Jam" "Abr"
[23] "Abr" "And" "Uly" "Uly" "Uly" "Rut" "Jam" "Che" "Gro" "Gro" "Ben"
[34] "Gro" "Wil" "Wil" "Wil" "The" "The" "Wil" "Wil" "Woo" "War" "Cal"
[45] "Cal" "Her" "Fra" "Fra" "Fra" "Har" "Har" "Dwi" "Joh" "Lyn" "Lyn"
[56] "Ric" "Ric" "Ger" "Jim" "Ron" "Geo" "Bil" "Geo" "Bar"

> # get the 4th through 8th characters
> str sub(string = presidents$PRESIDENT, start = 4, end = 8)


 [1] "rge W" "n Ada" "mas J" "mas J" "es Ma" "es Ma" "es Ma" "es Ma"
 [9] "es Mo" "n Qui" "rew J" "rew J" "tin V" "liam " "n Tyl" "es K."
[17] "hary " "lard " "nklin" "nklin" "es Bu" "aham " "aham " "rew J"
[25] "sses " "sses " "sses " "herfo" "es A." "ster " "ver C" "ver C"
[33] "jamin" "ver C" "liam " "liam " "liam " "odore" "odore" "liam "
[41] "liam " "drow " "ren G" "vin C" "vin C" "bert " "nklin" "nklin"
[49] "nklin" "ry S." "ry S." "ght D" "n F. " "don B" "don B" "hard "
[57] "hard " "ald R" "my Ca" "ald R" "rge B" "l Cli" "rge W" "ack O"

This is good for finding a president whose term started in a year ending in 1, which means he got elected in a year ending in 0, a preponderance of which ones died in office.

> presidents[str sub(string = presidents$Start, start = 4,
+ end = 4) == 1, c("YEAR", "PRESIDENT", "Start", "Stop")]

        YEAR              PRESIDENT Start  Stop
 3 1801-1805       Thomas Jefferson  1801  1805
14      1841 William Henry Harrison  1841  1841
15 1841-1845             John Tyler  1841  1845
22 1861-1865        Abraham Lincoln  1861  1865
29      1881      James A. Garfield  1881  1881
30 1881-1885      Chester A. Arthur  1881  1885
37      1901       William McKinley  1901  1901
38 1901-1905     Theodore Roosevelt  1901  1905
43 1921-1923      Warren G. Harding  1921  1923
48 1941-1945  Franklin D. Roosevelt  1941  1945
53 1961-1963        John F. Kennedy  1961  1963
60 1981-1989          Ronald Reagan  1981  1989
63 2001-2009         George W. Bush  2001  2009

13.4. Regular Expressions

Sifting through text often requires searching for patterns, and usually these patterns have to be general and flexible. This is where regular expressions are very useful. We will not make an exhaustive lesson of regular expressions but will illustrate how to use them within R.

Let’s say we want to find any president with “John” in his name, either first or last. Since we do not know where in the name “John” would occur, we cannot simply use str sub. Instead we use str detect.

> # returns TRUE/FALSE if John was found in the name
> johnPos <- str detect(string = presidents$PRESIDENT, pattern = "John")
> presidents[johnPos, c("YEAR", "PRESIDENT", "Start", "Stop")]

         YEAR             PRESIDENT Start Stop
 2    1797-1801          John Adams  1797 1801
10    1825-1829   John Quincy Adams  1825 1829
15    1841-1845          John Tyler  1841 1845
24    1865-1869      Andrew Johnson  1865 1869
53    1961-1963     John F. Kennedy  1961 1963
54    1963-1965   Lyndon B. Johnson  1963 1965
55    1963-1969   Lyndon B. Johnson  1963 1969

This found John Adams, John Quincy Adams, John Tyler, Andrew Johnson, John F. Kennedy and Lyndon B. Johnson. Note that regular expressions are case sensitive, so to ignore case we have to put the pattern in ignore.case.

> badSearch <- str detect(presidents$PRESIDENT, "john")
> goodSearch <- str detect(presidents$PRESIDENT, ignore.case("John"))
> sum(badSearch)

[1] 0

> sum(goodSearch)

[1] 7

To show off some more interesting regular expressions we will make use of yet another table from Wikipedia, the list of United States wars. Because we only care about one column, which has some encoding issues, we put an Rdata file of just that one column at http://www.jaredlander.com/data/warTimes.rdata. We load that file using load and we then see a new object in our session named warTimes.

For some odd reason, loading rdata files from a URL is not as straightforward as reading in a CSV file from a URL. A connection must first be made using url, then that connection is loaded with load, and then the connection must be closed with close.

> con <- url("http://www.jaredlander.com/data/warTimes.rdata")
> load(con)
> close(con)

This vector holds the starting and stopping dates of the wars. Sometimes it has just years, sometimes it also includes months and possibly days. There are instances where it has only one year. Because of this, it is a good dataset to comb through with various text functions. The first few entries follow.

> head(warTimes, 10)

 [1] "September 1, 1774 ACAEA September 3, 1783"
 [2] "September 1, 1774 ACAEA March 17, 1776"
 [3] "1775ACAEA1783"
 [4] "June 1775 ACAEA October 1776"
 [5] "July 1776 ACAEA March 1777"
 [6] "June 14, 1777 ACAEA October 17, 1777"
 [7] "1777ACAEA1778"
 [8] "1775ACAEA1782"
 [9] "1776ACAEA1794"
[10] "1778ACAEA1782"

We want to create a new column that contains information for the start of the war. To get at this information we need to split the Time column. Thanks to Wikipedia’s encoding, the separator is generally “ACAEA,” which was originally “ â€Â′′” and converted to these characters to make life easier. There are two instances where the “-” appears, once as a separator and once to make a hyphenated word. This is seen in the following code.

> warTimes[str detect(string = warTimes, pattern = "-")]

[1] "6 June 1944 ACAEA mid-July 1944"
[2] "25 August-17 December 1944"

So when we are splitting our string, we need to search for either “ACAEA” or “-.” In str split the pattern argument can take a regular expression. In this case it will be “(ACAEA)|-,” which tells the engine to search for either “(ACAEA)” or (denoted by the vertical pipe) “-” in the string. To avoid the instance, seen before, where the hyphen is used in “mid-July” we set the argument n to 2 so it returns at most only two pieces for each element of the input vector. The parentheses are not matched but rather act to group the characters “ACAEA” in the search.1 This grouping capability will prove important for advanced replacement of text, which will be demonstrated later in this section.

1. To match parentheses, they should be prefixed with a backslash ().

> theTimes <- str split(string = warTimes, pattern = "(ACAEA)|-", n = 2)
> head(theTimes)

[[1]]
[1] "September 1, 1774 " " September 3, 1783"

[[2]]
[1] "September 1, 1774 " " March 17, 1776"

[[3]]
[1] "1775" "1783"

[[4]]
[1] "June 1775 "    " October 1776"

[[5]]
[1] "July 1776 "  " March 1777"

[[6]]
[1] "June 14, 1777 " " October 17, 1777"

Seeing that this worked for the first few entries, we also check on the two instances where a hyphen was the separator.

> which(str detect(string = warTimes, pattern = "-"))

[1] 147 150

> theTimes[[147]]

[1] "6 June 1944 " " mid-July 1944"

> theTimes[[150]]

[1] "25 August" "17 December 1944"

This looks correct, as the first entry shows “mid-July” still intact while the second entry shows the two dates split apart.

For our purposes we only care about the start date of the wars, so we need to build a function that extracts the first (in some cases only) element of each vector in the list.

> theStart <- sapply(theTimes, FUN = function(x) x[1])
> head(theStart)

[1] "September 1, 1774 " "September 1, 1774 " "1775"
[4] "June 1775 "         "July 1776 "         "June 14, 1777 "

The original text sometimes had spaces around the separators and sometimes did not, meaning that some of our text has trailing white spaces. The easiest way to get rid of them is with the str trim function.

> theStart <- str trim(theStart)
> head(theStart)

[1] "September 1, 1774" "September 1, 1774" "1775"
[4] "June 1775"         "July 1776"         "June 14, 1777"

To extract the word “January” wherever it might occur, use str extract. In places where it is not found will be NA.

> # pull out 'January' anywhere it's found, otherwise return NA
> str extract(string = theStart, pattern = "January")

  [1] NA        NA        NA        NA        NA        NA
  [7] NA        NA        NA        NA        NA        NA
 [13] "January" NA        NA        NA        NA        NA
 [19] NA        NA        NA        NA        NA        NA
 [25] NA        NA        NA        NA        NA        NA
 [31] NA        NA        NA        NA        NA        NA
 [37] NA        NA        NA        NA        NA        NA
 [43] NA        NA        NA        NA        NA        NA
 [49] NA        NA        NA        NA        NA        NA
 [55] NA        NA        NA        NA        NA        NA
 [61] NA        NA        NA        NA        NA        NA
 [67] NA        NA        NA        NA        NA        NA
 [73] NA        NA        NA        NA        NA        NA
 [79] NA        NA        NA        NA        NA        NA
 [85] NA        NA        NA        NA        NA        NA
 [91] NA        NA        NA        NA        NA        NA
 [97] NA        NA        "January" NA        NA        NA
[103] NA        NA        NA        NA        NA        NA
[109] NA        NA        NA        NA        NA        NA
[115] NA        NA        NA        NA        NA        NA
[121] NA        NA        NA        NA        NA        NA
[127] NA        NA        NA        NA        "January" NA
[133] NA        NA        "January" NA        NA        NA
[139] NA        NA        NA        NA        NA        NA
[145] "January" "January" NA        NA        NA        NA
[151] NA        NA        NA        NA        NA        NA
[157] NA        NA        NA        NA        NA        NA
[163] NA        NA        NA        NA        NA        NA
[169] "January" NA        NA        NA        NA        NA
[175] NA        NA        NA        NA        NA        NA
[181] "January" NA        NA        NA        NA        "January"
[187] NA        NA

To find elements that contain “January” and return the entire entry—not just “January”—use str detect and subset theStart with the results.

> # just return elements where 'January' was detected
> theStart[str detect(string = theStart, pattern = "January")]

[1] "January"          "January 21"       "January 1942"
[4] "January"          "January 22, 1944" "22 January 1944"
[7] "January 4, 1989"  "15 January 2002"  "January 14, 2010"

To extract the year, we search for an occurrence of four numbers together. Because we do not know specific numbers, we have to use a pattern. In a regular expression search, “[0-9]” searches for any number. We use “[0-9][0-9][0-9][0-9]” to search for four consecutive numbers.

> # get incidents of 4 numeric digits in a row
> head(str_extract(string = theStart, "[0-9][0-9][0-9][0-9]"), 20)

 [1] "1774" "1774" "1775" "1775" "1776" "1777" "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA     "1785" "1798" "1801" NA     "1812"
[19] "1812" "1813"

Writing “[0-9]” repeatedly is inefficient, especially when searching for many occurences of a number. Putting “4” in curly braces after “[0-9]” causes the engine to search for any set of four numbers.

> # a smarter way to search for four numbers
> head(str_extract(string = theStart, "[0-9]{4}"), 20)

 [1] "1774" "1774" "1775" "1775" "1776" "1777" "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA     "1785" "1798" "1801" NA     "1812"
[19] "1812" "1813"

Even writing “[0-9]” can be inefficient, so there is a shortcut to denote any integer. In most other languages the shortcut is “d” but in R there needs to be two backslashes (“\d”).

> # "\d" is a shortcut for "[0-9]"
> head(str_extract(string = theStart, "\d{4}"), 20)

 [1] "1774" "1774" "1775" "1775" "1776" "1777" "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA     "1785" "1798" "1801" NA     "1812"
[19] "1812" "1813"

The curly braces offer even more functionality: for instance, searching for a number one to three times.

> # this looks for any digit that occurs either once, twice or thrice
> str_extract(string = theStart, "\d{1,3}")

  [1] "1"   "1"   "177" "177" "177" "14"  "177" "177" "177" "177"
 [11] "177" "177" NA    "178" "179" "180" NA    "18"  "181" "181"
 [21] "181" "181" "181" "181" "181" "181" "181" "181" "181" "181"
 [31] "22"  "181" "181" "5"   "182" "182" "182" NA    "6"   "183"
 [41] "23"  "183" "19"  "11"  "25"  "184" "184" "184" "184" "184"
 [51] "185" "184" "28"  "185" "13"  "4"   "185" "185" "185" "185"
 [61] "185" "185" "6"   "185" "6"   "186" "12"  "186" "186" "186"
 [71] "186" "186" "17"  "31"  "186" "20"  "186" "186" "186" "186"
 [81] "186" "17"  "1"   "6"   "12"  "27"  "187" "187" "187" "187"
 [91] "187" "187" NA    "30"  "188" "189" "22"  "189" "21"  "189"
[101] "25"  "189" "189" "189" "189" "189" "189" "2"   "189" "28"
[111] "191" "21"  "28"  "191" "191" "191" "191" "191" "191" "191"
[121] "191" "191" "191" "7"   "194" "194" NA    NA    "3"   "7"
[131] "194" "194" NA    "20"  NA    "1"   "16"  "194" "8"   "194"
[141] "17"  "9"   "194" "3"   "22"  "22"  "6"   "6"   "15"  "25"
[151] "25"  "16"  "8"   "6"   "194" "195" "195" "195" "195" "197"
[161] "28"  "25"  "15"  "24"  "19"  "198" "15"  "198" "4"   "20"
[171] "2"   "199" "199" "199" "19"  "20"  "24"  "7"   "7"   "7"
[181] "15"  "7"   "6"   "20"  "16"  "14"  "200" "19"

Regular expressions can search for text with anchors indicating the beginning of a line (“^”) and the end of a line (“$”).

> # extract 4 digits at the beginning of the text
> head(str_extract(string = theStart, pattern = "^\d{4}"), 30)

 [1] NA     NA     "1775" NA     NA     NA     "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA     "1785" "1798" "1801" NA     NA
[19] "1812" "1813" "1812" "1812" "1813" "1813" "1813" "1814" "1813"
[28] "1814" "1813" "1815"

> # extract 4 digits at the end of the text
> head(str_extract(string = theStart, pattern = "\d{4}$"), 30)

 [1] "1774" "1774" "1775" "1775" "1776" "1777" "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA     "1785" "1798" "1801" NA     "1812"
[19] "1812" "1813" "1812" "1812" "1813" "1813" "1813" "1814" "1813"
[28] "1814" "1813" "1815"

> # extract 4 digits at the beginning AND the end of the text
> head(str_extract(string = theStart, pattern = "^\d{4}$"), 30)

 [1] NA     NA     "1775" NA     NA     NA     "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA     "1785" "1798" "1801" NA     NA
[19] "1812" "1813" "1812" "1812" "1813" "1813" "1813" "1814" "1813"
[28] "1814" "1813" "1815"

Replacing text selectively is another powerful feature of regular expressions. We start by simply replacing numbers with a fixed value.


> # replace the first digit seen with "x"
> head(str_replace(string=theStart, pattern="\d", replacement="x"), 30)

 [1] "September x, 1774" "September x, 1774" "x775"
 [4] "June x775"         "July x776"         "June x4, 1777"
 [7] "x777"              "x775"              "x776"
[10] "x778"              "x775"              "x779"
[13] "January"           "x785"              "x798"
[16] "x801"              "August"            "June x8, 1812"
[19] "x812"              "x813"              "x812"
[22] "x812"              "x813"              "x813"
[25] "x813"              "x814"              "x813"
[28] "x814"              "x813"              "x815"

> # replace all digits seen with "x"
> # this means "7" -> "x" and "382" -> "xxx"
> head(str replace all(string=theStart, pattern="\d", replacement="x"),
+      30)

 [1] "September x, xxxx" "September x, xxxx" "xxxx"
 [4] "June xxxx"         "July xxxx"         "June xx, xxxx"
 [7] "xxxx"              "xxxx"              "xxxx"
[10] "xxxx"              "xxxx"              "xxxx"
[13] "January"           "xxxx"              "xxxx"
[16] "xxxx"              "August"            "June xx, xxxx"
[19] "xxxx"              "xxxx"              "xxxx"
[22] "xxxx"              "xxxx"              "xxxx"
[25] "xxxx"              "xxxx"              "xxxx"
[28] "xxxx"              "xxxx"              "xxxx"

> # replace any strings of digits from 1 to 4 in length with "x"
> # this means "7" -> "x" and "382" -> "x"
> head(str replace all(string=theStart, pattern="\d{1,4}",
+                       replacement="x"), 30)

 [1] "September x, x" "September x, x" "x"
 [4] "June x"         "July x"         "June x, x"
 [7] "x"              "x"              "x"
[10] "x"              "x"              "x"
[13] "January"        "x"              "x"
[16] "x"              "August"         "June x, x"
[19] "x"              "x"              "x"
[22] "x"              "x"              "x"
[25] "x"              "x"              "x"
[28] "x"              "x"              "x"

Not only can regular expressions substitute fixed values into a string, they can also substitute part of the search pattern. To see this, we create a vector of some HTML commands.

> # create a vector of HTML commands
> commands <- c("<a href=index.html>The Link is here</a>",
+               "<b>This is bold text</b>")

Now we would like to extract the text between the HTML tags. The pattern is a set of opening and closing angle brackets with something in between (“<.+?>”), some text (“.+?”) and another set of opening and closing brackets (“<.+?>”). The “.” indicates a search for anything, while the “+” means to search for it one or more times with the “?” meaning it is not a greedy search. Because we do not know what the text between the tags will be, and that is what we want to substitute back into the text, we group it inside parentheses and use a back reference to reinsert it using “\1,” which indicates use of the first grouping. Subsequent groupings are referenced using subsequent numerals, up to nine. In other languages a “$” is used instead of “\.”

> # get the text between the HTML tags
> # the content in (.+?) is substituted using 1
> str replace(string=commands, pattern="<.+?>(.+?)<.+>",
+             replacement="\1")

[1] "The Link is here" "This is bold text"

Since R has its own regular expression peculiarities, there is a handy help file that can be accessed with ?regex.

13.5. Conclusion

R has many facilities for dealing with text, whether creating, extracting or manipulating it. For creating text, it is best to use sprintf and if necessary paste. For all other text needs, it is best to use Hadley Wickham’s stringr package. This includes pulling out text specified by character position (str sub), regular expressions (str detect, str extract and str replace) and splitting strings (str split).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset