Strings (character
data) often need to be constructed or deconstructed to identify observations, preprocess text, combine information or satisfy any number of other needs. R
offers functions for building strings, like paste
and sprintf
. It also provides a number of functions for using regular expressions and examining text data, although for those purposes it is better to use Hadley Wickham’s stringr
package.
The first function new R
users reach for when putting together strings is paste
. This function takes a series of strings, or expressions that evaluate to strings, and puts them together into one string. We start by putting together three simple strings.
> paste("Hello", "Jared", "and others")
[1] "Hello Jared and others"
Notice that spaces were put between the strings. This is because paste
has a third argument, sep
, that determines what to put in between entries. This can be any valid text, including empty text (""
).
> paste("Hello", "Jared", "and others", sep = "/")
[1] "Hello/Jared/and others"
Like many functions in R
, paste
is vectorized. This means each element can be a vector
of data to be put together.
> paste(c("Hello", "Hey", "Howdy"), c("Jared", "Bob", "David"))
[1] "Hello Jared" "Hey Bob" "Howdy David"
In this case each vector
had the same number of entries so they paired one-to-one. When the vector
s do not have the same length they are recycled.
> paste("Hello", c("Jared", "Bob", "David"))
[1] "Hello Jared" "Hello Bob" "Hello David"
> paste("Hello", c("Jared", "Bob", "David"), c("Goodbye", "Seeya"))
[1] "Hello Jared Goodbye" "Hello Bob Seeya" "Hello David Goodbye"
paste
also has the ability to collapse a vector
of text into one vector
containing all the elements with any arbitrary separator, using the collapse
argument.
> vectorOfText <- c("Hello", "Everyone", "out there", ".")
> paste(vectorOfText, collapse = " ")
[1] "Hello Everyone out there ."
> paste(vectorOfText, collapse = "*")
[1] "Hello*Everyone*out there*."
While paste
is convenient for putting together short bits of text, it can become unwieldy when piecing together long pieces of text, such as when inserting a number of variables into a long piece of text. For instance, we might have a lengthy sentence that has a few spots that require the insertion of special variables. An example is “Hello Jared, your party of eight will be seated in 25 minutes” where “Jared,” “eight” and “25” could be replaced with other information.
Reforming this with paste
can make reading the line in code difficult.
To start, we make some variables to hold the information.
> person <- "Jared"
> partySize <- "eight"
> waitTime <- 25
Now we build the paste
expression.
> paste("Hello ", person, ", your party of ", partySize,
+ " will be seated in ", waitTime, " minutes.", sep="")
[1] "Hello Jared, your party of eight will be seated in 25 minutes."
Making even a small change to this sentence would require putting the commas in just the right places.
A good alternative is the sprintf
function. With this function we build one long string with special markers indicating where to insert values.
> sprintf("Hello %s, your party of %s will be seated in %s minutes",
+ person, partySize, waitTime)
[1] "Hello Jared, your party of eight will be seated in 25 minutes"
Here, each %s
was replaced with its corresponding variable. While the long sentence is easier to read in code, we must maintain the order of %s
’s and variables.
sprintf
is also vectorized. Note that the vector
lengths must be multiples of each other.
> sprintf("Hello %s, your party of %s will be seated in %s minutes",
+ c("Jared", "Bob"), c("eight", 16, "four", 10), waitTime)
[1] "Hello Jared, your party of eight will be seated in 25 minutes"
[2] "Hello Bob, your party of 16 will be seated in 25 minutes"
[3] "Hello Jared, your party of four will be seated in 25 minutes"
[4] "Hello Bob, your party of 10 will be seated in 25 minutes"
Often text needs to be ripped apart to be made useful, and while R
has a number of functions for doing so, the stringr
package is much easier to use.
First we need some data, so we use the XML
package to download a table of United States presidents from Wikipedia.
> require(XML)
Then we use readHTMLTable
to parse the table.
> load("data/presidents.rdata")
> theURL <- "http://www.loc.gov/rr/print/list/057_chron.html"
> presidents <- readHTMLTable(theURL, which=3, as.data.frame=TRUE,
+ skip.rows=1, header=TRUE,
+ stringsAsFactors=FALSE)
Now we take a look at the data.
> head(presidents)
YEAR PRESIDENT
1 1789-1797 George Washington
2 1797-1801 John Adams
3 1801-1805 Thomas Jefferson
4 1805-1809 Thomas Jefferson
5 1809-1812 James Madison
6 1812-1813 James Madison
FIRST LADY VICE PRESIDENT
1 Martha Washington John Adams
2 Abigail Adams Thomas Jefferson
3 Martha Wayles Skelton Jefferson
(no image) Aaron Burr
4 Martha Wayles Skelton Jefferson
(no image) George Clinton
5 Dolley Madison George Clinton
6 Dolley Madison office vacant
Examining it more closely, we see that the last few rows contain information we do not want, so we keep only the first 64 rows.
> tail(presidents$YEAR)
[1] "2001-2009"
[2] "2009-"
[3] "Presidents: Introduction (Rights/Ordering
Info.) | Adams
- Cleveland | Clinton - Harding Harrison
- Jefferson | Johnson
- McKinley | Monroe
- Roosevelt | Taft - Truman |
Tyler
- WilsonList of names, Alphabetically"
[4] "First Ladies: Introduction
(Rights/Ordering Info.) | Adams
- Coolidge | Eisenhower
- HooverJackson
- Pierce |
Polk - Wilson | List
of names, Alphabetically"
[5] "Vice Presidents: Introduction (Rights/Ordering Info.) |
Adams - Coolidge | Curtis - Hobart Humphrey - Rockefeller | Roosevelt
- WilsonList of names, Alphabetically"
[6] "Top
of Page"
> presidents <- presidents[1:64, ]
To start, we create two new columns, one for the beginning of the term and one for the end of the term. To do this we need to split the Year
column on the hyphen (-). The stringr
package has the str split
function that splits a string based on some value. It returns a list
with an element for each element of the input vector
. Each of these elements has as many elements as necessary for the split, in this case either two (a start and stop year) or one (when the president served less than one year).
> require(stringr)
> # split the string
> yearList <- str split(string = presidents$YEAR, pattern = "-")
> head(yearList)
[[1]]
[1] "1789" "1797"
[[2]]
[1] "1797" "1801"
[[3]]
[1] "1801" "1805"
[[4]]
[1] "1805" "1809"
[[5]]
[1] "1809" "1812"
[[6]]
[1] "1812" "1813"
> # combine them into one matrix
> yearMatrix <- data.frame(Reduce(rbind, yearList))
> head(yearMatrix)
X1 X2
1 1789 1797
2 1797 1801
3 1801 1805
4 1805 1809
5 1809 1812
6 1812 1813
> # give the columns good names
> names(yearMatrix) <- c("Start", "Stop")
> # bind the new columns onto the data.frame
> presidents <- cbind(presidents, yearMatrix)
> # convert the start and stop columns into numeric
> presidents$Start <- as.numeric(as.character(presidents$Start))
> presidents$Stop <- as.numeric(as.character(presidents$Stop))
> # view the changes
> head(presidents)
YEAR PRESIDENT
1 1789-1797 George Washington
2 1797-1801 John Adams
3 1801-1805 Thomas Jefferson
4 1805-1809 Thomas Jefferson
5 1809-1812 James Madison
6 1812-1813 James Madison
FIRST LADY VICE PRESIDENT
1 Martha Washington John Adams
2 Abigail Adams Thomas Jefferson
3 Martha Wayles Skelton Jefferson
(no image) Aaron Burr
4 Martha Wayles Skelton Jefferson
(no image) George Clinton
5 Dolley Madison George Clinton
6 Dolley Madison office vacant
Start Stop
1 1789 1797
2 1797 1801
3 1801 1805
4 1805 1809
5 1809 1812
6 1812 1813
> tail(presidents)
YEAR PRESIDENT FIRST LADY VICE PRESIDENT
59 1977-1981 Jimmy Carter Rosalynn Carter Walter F. Mondale
60 1981-1989 Ronald Reagan Nancy Reagan George Bush
61 1989-1993 George Bush Barbara Bush Dan Quayle
62 1993-2001 Bill Clinton Hillary Rodham Clinton Albert Gore
63 2001-2009 George W. Bush Laura Bush Richard Cheney
64 2009- Barack Obama Michelle Obama Joseph R. Biden
Start Stop
59 1977 1981
60 1981 1989
61 1989 1993
62 1993 2001
63 2001 2009
64 2009 NA
In the preceding example there was a quirk of R
that can be frustrating at first pass. In order to convert the factor presidents$Start
into a numeric
, we first had to convert it into a character
. That is because factor
s are simply labels on top of integers, as seen in Section 4.4.2. So when applying as.numeric
to a factor
, it is converted to the underlying integers.
Just like in Excel, it is possible to select specified characters from text using str sub
.
> # get the first 3 characters
> str sub(string = presidents$PRESIDENT, start = 1, end = 3)
[1] "Geo" "Joh" "Tho" "Tho" "Jam" "Jam" "Jam" "Jam" "Jam" "Joh" "And"
[12] "And" "Mar" "Wil" "Joh" "Jam" "Zac" "Mil" "Fra" "Fra" "Jam" "Abr"
[23] "Abr" "And" "Uly" "Uly" "Uly" "Rut" "Jam" "Che" "Gro" "Gro" "Ben"
[34] "Gro" "Wil" "Wil" "Wil" "The" "The" "Wil" "Wil" "Woo" "War" "Cal"
[45] "Cal" "Her" "Fra" "Fra" "Fra" "Har" "Har" "Dwi" "Joh" "Lyn" "Lyn"
[56] "Ric" "Ric" "Ger" "Jim" "Ron" "Geo" "Bil" "Geo" "Bar"
> # get the 4th through 8th characters
> str sub(string = presidents$PRESIDENT, start = 4, end = 8)
[1] "rge W" "n Ada" "mas J" "mas J" "es Ma" "es Ma" "es Ma" "es Ma"
[9] "es Mo" "n Qui" "rew J" "rew J" "tin V" "liam " "n Tyl" "es K."
[17] "hary " "lard " "nklin" "nklin" "es Bu" "aham " "aham " "rew J"
[25] "sses " "sses " "sses " "herfo" "es A." "ster " "ver C" "ver C"
[33] "jamin" "ver C" "liam " "liam " "liam " "odore" "odore" "liam "
[41] "liam " "drow " "ren G" "vin C" "vin C" "bert " "nklin" "nklin"
[49] "nklin" "ry S." "ry S." "ght D" "n F. " "don B" "don B" "hard "
[57] "hard " "ald R" "my Ca" "ald R" "rge B" "l Cli" "rge W" "ack O"
This is good for finding a president whose term started in a year ending in 1, which means he got elected in a year ending in 0, a preponderance of which ones died in office.
> presidents[str sub(string = presidents$Start, start = 4,
+ end = 4) == 1, c("YEAR", "PRESIDENT", "Start", "Stop")]
YEAR PRESIDENT Start Stop
3 1801-1805 Thomas Jefferson 1801 1805
14 1841 William Henry Harrison 1841 1841
15 1841-1845 John Tyler 1841 1845
22 1861-1865 Abraham Lincoln 1861 1865
29 1881 James A. Garfield 1881 1881
30 1881-1885 Chester A. Arthur 1881 1885
37 1901 William McKinley 1901 1901
38 1901-1905 Theodore Roosevelt 1901 1905
43 1921-1923 Warren G. Harding 1921 1923
48 1941-1945 Franklin D. Roosevelt 1941 1945
53 1961-1963 John F. Kennedy 1961 1963
60 1981-1989 Ronald Reagan 1981 1989
63 2001-2009 George W. Bush 2001 2009
Sifting through text often requires searching for patterns, and usually these patterns have to be general and flexible. This is where regular expressions are very useful. We will not make an exhaustive lesson of regular expressions but will illustrate how to use them within R
.
Let’s say we want to find any president with “John” in his name, either first or last. Since we do not know where in the name “John” would occur, we cannot simply use str sub
. Instead we use str detect
.
> # returns TRUE/FALSE if John was found in the name
> johnPos <- str detect(string = presidents$PRESIDENT, pattern = "John")
> presidents[johnPos, c("YEAR", "PRESIDENT", "Start", "Stop")]
YEAR PRESIDENT Start Stop
2 1797-1801 John Adams 1797 1801
10 1825-1829 John Quincy Adams 1825 1829
15 1841-1845 John Tyler 1841 1845
24 1865-1869 Andrew Johnson 1865 1869
53 1961-1963 John F. Kennedy 1961 1963
54 1963-1965 Lyndon B. Johnson 1963 1965
55 1963-1969 Lyndon B. Johnson 1963 1969
This found John Adams, John Quincy Adams, John Tyler, Andrew Johnson, John F. Kennedy and Lyndon B. Johnson. Note that regular expressions are case sensitive, so to ignore case we have to put the pattern in ignore.case
.
> badSearch <- str detect(presidents$PRESIDENT, "john")
> goodSearch <- str detect(presidents$PRESIDENT, ignore.case("John"))
> sum(badSearch)
[1] 0
> sum(goodSearch)
[1] 7
To show off some more interesting regular expressions we will make use of yet another table from Wikipedia, the list of United States wars. Because we only care about one column, which has some encoding issues, we put an Rdata
file of just that one column at http://www.jaredlander.com/data/warTimes.rdata
. We load that file using load
and we then see a new object in our session named warTimes
.
For some odd reason, loading rdata
files from a URL is not as straightforward as reading in a CSV file from a URL. A connection must first be made using url
, then that connection is loaded with load
, and then the connection must be closed with close
.
> con <- url("http://www.jaredlander.com/data/warTimes.rdata")
> load(con)
> close(con)
This vector
holds the starting and stopping dates of the wars. Sometimes it has just years, sometimes it also includes months and possibly days. There are instances where it has only one year. Because of this, it is a good dataset to comb through with various text functions. The first few entries follow.
> head(warTimes, 10)
[1] "September 1, 1774 ACAEA September 3, 1783"
[2] "September 1, 1774 ACAEA March 17, 1776"
[3] "1775ACAEA1783"
[4] "June 1775 ACAEA October 1776"
[5] "July 1776 ACAEA March 1777"
[6] "June 14, 1777 ACAEA October 17, 1777"
[7] "1777ACAEA1778"
[8] "1775ACAEA1782"
[9] "1776ACAEA1794"
[10] "1778ACAEA1782"
We want to create a new column that contains information for the start of the war. To get at this information we need to split the Time
column. Thanks to Wikipedia’s encoding, the separator is generally “ACAEA,” which was originally “ â€Â′′” and converted to these characters to make life easier. There are two instances where the “-” appears, once as a separator and once to make a hyphenated word. This is seen in the following code.
> warTimes[str detect(string = warTimes, pattern = "-")]
[1] "6 June 1944 ACAEA mid-July 1944"
[2] "25 August-17 December 1944"
So when we are splitting our string, we need to search for either “ACAEA” or “-.” In str split
the pattern
argument can take a regular expression. In this case it will be “(ACAEA)|-,” which tells the engine to search for either “(ACAEA)” or (denoted by the vertical pipe) “-” in the string. To avoid the instance, seen before, where the hyphen is used in “mid-July” we set the argument n
to 2 so it returns at most only two pieces for each element of the input vector
. The parentheses are not matched but rather act to group the characters “ACAEA” in the search.1 This grouping capability will prove important for advanced replacement of text, which will be demonstrated later in this section.
1. To match parentheses, they should be prefixed with a backslash ().
> theTimes <- str split(string = warTimes, pattern = "(ACAEA)|-", n = 2)
> head(theTimes)
[[1]]
[1] "September 1, 1774 " " September 3, 1783"
[[2]]
[1] "September 1, 1774 " " March 17, 1776"
[[3]]
[1] "1775" "1783"
[[4]]
[1] "June 1775 " " October 1776"
[[5]]
[1] "July 1776 " " March 1777"
[[6]]
[1] "June 14, 1777 " " October 17, 1777"
Seeing that this worked for the first few entries, we also check on the two instances where a hyphen was the separator.
> which(str detect(string = warTimes, pattern = "-"))
[1] 147 150
> theTimes[[147]]
[1] "6 June 1944 " " mid-July 1944"
> theTimes[[150]]
[1] "25 August" "17 December 1944"
This looks correct, as the first entry shows “mid-July” still intact while the second entry shows the two dates split apart.
For our purposes we only care about the start date of the wars, so we need to build a function that extracts the first (in some cases only) element of each vector
in the list
.
> theStart <- sapply(theTimes, FUN = function(x) x[1])
> head(theStart)
[1] "September 1, 1774 " "September 1, 1774 " "1775"
[4] "June 1775 " "July 1776 " "June 14, 1777 "
The original text sometimes had spaces around the separators and sometimes did not, meaning that some of our text has trailing white spaces. The easiest way to get rid of them is with the str trim
function.
> theStart <- str trim(theStart)
> head(theStart)
[1] "September 1, 1774" "September 1, 1774" "1775"
[4] "June 1775" "July 1776" "June 14, 1777"
To extract the word “January” wherever it might occur, use str extract
. In places where it is not found will be NA
.
> # pull out 'January' anywhere it's found, otherwise return NA
> str extract(string = theStart, pattern = "January")
[1] NA NA NA NA NA NA
[7] NA NA NA NA NA NA
[13] "January" NA NA NA NA NA
[19] NA NA NA NA NA NA
[25] NA NA NA NA NA NA
[31] NA NA NA NA NA NA
[37] NA NA NA NA NA NA
[43] NA NA NA NA NA NA
[49] NA NA NA NA NA NA
[55] NA NA NA NA NA NA
[61] NA NA NA NA NA NA
[67] NA NA NA NA NA NA
[73] NA NA NA NA NA NA
[79] NA NA NA NA NA NA
[85] NA NA NA NA NA NA
[91] NA NA NA NA NA NA
[97] NA NA "January" NA NA NA
[103] NA NA NA NA NA NA
[109] NA NA NA NA NA NA
[115] NA NA NA NA NA NA
[121] NA NA NA NA NA NA
[127] NA NA NA NA "January" NA
[133] NA NA "January" NA NA NA
[139] NA NA NA NA NA NA
[145] "January" "January" NA NA NA NA
[151] NA NA NA NA NA NA
[157] NA NA NA NA NA NA
[163] NA NA NA NA NA NA
[169] "January" NA NA NA NA NA
[175] NA NA NA NA NA NA
[181] "January" NA NA NA NA "January"
[187] NA NA
To find elements that contain “January” and return the entire entry—not just “January”—use str detect
and subset theStart
with the results.
> # just return elements where 'January' was detected
> theStart[str detect(string = theStart, pattern = "January")]
[1] "January" "January 21" "January 1942"
[4] "January" "January 22, 1944" "22 January 1944"
[7] "January 4, 1989" "15 January 2002" "January 14, 2010"
To extract the year, we search for an occurrence of four numbers together. Because we do not know specific numbers, we have to use a pattern. In a regular expression search, “[0-9]” searches for any number. We use “[0-9][0-9][0-9][0-9]” to search for four consecutive numbers.
> # get incidents of 4 numeric digits in a row
> head(str_extract(string = theStart, "[0-9][0-9][0-9][0-9]"), 20)
[1] "1774" "1774" "1775" "1775" "1776" "1777" "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA "1785" "1798" "1801" NA "1812"
[19] "1812" "1813"
Writing “[0-9]” repeatedly is inefficient, especially when searching for many occurences of a number. Putting “4” in curly braces after “[0-9]” causes the engine to search for any set of four numbers.
> # a smarter way to search for four numbers
> head(str_extract(string = theStart, "[0-9]{4}"), 20)
[1] "1774" "1774" "1775" "1775" "1776" "1777" "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA "1785" "1798" "1801" NA "1812"
[19] "1812" "1813"
Even writing “[0-9]” can be inefficient, so there is a shortcut to denote any integer. In most other languages the shortcut is “d” but in R
there needs to be two backslashes (“\d”).
> # "\d" is a shortcut for "[0-9]"
> head(str_extract(string = theStart, "\d{4}"), 20)
[1] "1774" "1774" "1775" "1775" "1776" "1777" "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA "1785" "1798" "1801" NA "1812"
[19] "1812" "1813"
The curly braces offer even more functionality: for instance, searching for a number one to three times.
> # this looks for any digit that occurs either once, twice or thrice
> str_extract(string = theStart, "\d{1,3}")
[1] "1" "1" "177" "177" "177" "14" "177" "177" "177" "177"
[11] "177" "177" NA "178" "179" "180" NA "18" "181" "181"
[21] "181" "181" "181" "181" "181" "181" "181" "181" "181" "181"
[31] "22" "181" "181" "5" "182" "182" "182" NA "6" "183"
[41] "23" "183" "19" "11" "25" "184" "184" "184" "184" "184"
[51] "185" "184" "28" "185" "13" "4" "185" "185" "185" "185"
[61] "185" "185" "6" "185" "6" "186" "12" "186" "186" "186"
[71] "186" "186" "17" "31" "186" "20" "186" "186" "186" "186"
[81] "186" "17" "1" "6" "12" "27" "187" "187" "187" "187"
[91] "187" "187" NA "30" "188" "189" "22" "189" "21" "189"
[101] "25" "189" "189" "189" "189" "189" "189" "2" "189" "28"
[111] "191" "21" "28" "191" "191" "191" "191" "191" "191" "191"
[121] "191" "191" "191" "7" "194" "194" NA NA "3" "7"
[131] "194" "194" NA "20" NA "1" "16" "194" "8" "194"
[141] "17" "9" "194" "3" "22" "22" "6" "6" "15" "25"
[151] "25" "16" "8" "6" "194" "195" "195" "195" "195" "197"
[161] "28" "25" "15" "24" "19" "198" "15" "198" "4" "20"
[171] "2" "199" "199" "199" "19" "20" "24" "7" "7" "7"
[181] "15" "7" "6" "20" "16" "14" "200" "19"
Regular expressions can search for text with anchors indicating the beginning of a line (“^”) and the end of a line (“$”).
> # extract 4 digits at the beginning of the text
> head(str_extract(string = theStart, pattern = "^\d{4}"), 30)
[1] NA NA "1775" NA NA NA "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA "1785" "1798" "1801" NA NA
[19] "1812" "1813" "1812" "1812" "1813" "1813" "1813" "1814" "1813"
[28] "1814" "1813" "1815"
> # extract 4 digits at the end of the text
> head(str_extract(string = theStart, pattern = "\d{4}$"), 30)
[1] "1774" "1774" "1775" "1775" "1776" "1777" "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA "1785" "1798" "1801" NA "1812"
[19] "1812" "1813" "1812" "1812" "1813" "1813" "1813" "1814" "1813"
[28] "1814" "1813" "1815"
> # extract 4 digits at the beginning AND the end of the text
> head(str_extract(string = theStart, pattern = "^\d{4}$"), 30)
[1] NA NA "1775" NA NA NA "1777" "1775" "1776"
[10] "1778" "1775" "1779" NA "1785" "1798" "1801" NA NA
[19] "1812" "1813" "1812" "1812" "1813" "1813" "1813" "1814" "1813"
[28] "1814" "1813" "1815"
Replacing text selectively is another powerful feature of regular expressions. We start by simply replacing numbers with a fixed value.
> # replace the first digit seen with "x"
> head(str_replace(string=theStart, pattern="\d", replacement="x"), 30)
[1] "September x, 1774" "September x, 1774" "x775"
[4] "June x775" "July x776" "June x4, 1777"
[7] "x777" "x775" "x776"
[10] "x778" "x775" "x779"
[13] "January" "x785" "x798"
[16] "x801" "August" "June x8, 1812"
[19] "x812" "x813" "x812"
[22] "x812" "x813" "x813"
[25] "x813" "x814" "x813"
[28] "x814" "x813" "x815"
> # replace all digits seen with "x"
> # this means "7" -> "x" and "382" -> "xxx"
> head(str replace all(string=theStart, pattern="\d", replacement="x"),
+ 30)
[1] "September x, xxxx" "September x, xxxx" "xxxx"
[4] "June xxxx" "July xxxx" "June xx, xxxx"
[7] "xxxx" "xxxx" "xxxx"
[10] "xxxx" "xxxx" "xxxx"
[13] "January" "xxxx" "xxxx"
[16] "xxxx" "August" "June xx, xxxx"
[19] "xxxx" "xxxx" "xxxx"
[22] "xxxx" "xxxx" "xxxx"
[25] "xxxx" "xxxx" "xxxx"
[28] "xxxx" "xxxx" "xxxx"
> # replace any strings of digits from 1 to 4 in length with "x"
> # this means "7" -> "x" and "382" -> "x"
> head(str replace all(string=theStart, pattern="\d{1,4}",
+ replacement="x"), 30)
[1] "September x, x" "September x, x" "x"
[4] "June x" "July x" "June x, x"
[7] "x" "x" "x"
[10] "x" "x" "x"
[13] "January" "x" "x"
[16] "x" "August" "June x, x"
[19] "x" "x" "x"
[22] "x" "x" "x"
[25] "x" "x" "x"
[28] "x" "x" "x"
Not only can regular expressions substitute fixed values into a string, they can also substitute part of the search pattern. To see this, we create a vector
of some HTML
commands.
> # create a vector of HTML commands
> commands <- c("<a href=index.html>The Link is here</a>",
+ "<b>This is bold text</b>")
Now we would like to extract the text between the HTML
tags. The pattern is a set of opening and closing angle brackets with something in between (“<.+?>”), some text (“.+?”) and another set of opening and closing brackets (“<.+?>”). The “.” indicates a search for anything, while the “+” means to search for it one or more times with the “?” meaning it is not a greedy search. Because we do not know what the text between the tags will be, and that is what we want to substitute back into the text, we group it inside parentheses and use a back reference to reinsert it using “\1,” which indicates use of the first grouping. Subsequent groupings are referenced using subsequent numerals, up to nine. In other languages a “$” is used instead of “\.”
> # get the text between the HTML tags
> # the content in (.+?) is substituted using 1
> str replace(string=commands, pattern="<.+?>(.+?)<.+>",
+ replacement="\1")
[1] "The Link is here" "This is bold text"
Since R
has its own regular expression peculiarities, there is a handy help file that can be accessed with ?regex
.
R
has many facilities for dealing with text, whether creating, extracting or manipulating it. For creating text, it is best to use sprintf
and if necessary paste
. For all other text needs, it is best to use Hadley Wickham’s stringr
package. This includes pulling out text specified by character position (str sub
), regular expressions (str detect
, str extract
and str replace
) and splitting strings (str split
).