In R, the easiest-to-use package for web scraping is rvest
. Run the following code to install the package from CRAN:
install.packages("rvest")
First, we load the package and use read_html()
to read data/single-table.html
and try to extract the table from the web page:
library(rvest) ## Loading required package: xml2 single_table_page <- read_html("data/single-table.html") single_table_page ## {xml_document} ## <html> ## [1] <head> <title>Single table</title> </head> ## [2] <body> <p>The following is a table</p> <table i ...
Note that single_table_page
is a parsed HTML document, which is a nested data structure of HTML nodes.
A typical process for scraping information from such a web page using rvest
functions is: First, locate the HTML nodes from which we need to extract data. Then, use either the CSS selector or XPath expression to filter the HTML nodes so that the nodes we need are selected and those we don't need are omitted. Finally, use proper selectors with html_nodes()
to take a subset of nodes, html_attrs()
to extract attributes, and html_text()
to extract text from the parsed web page.
The package also provides simple functions that directly extract data from a web page and return a data frame. For example, to extract all <table>
elements from it, we directly call html_table()
:
html_table(single_table_page) ## [[1]] ## Name Age ## 1 Jenny 18 ## 2 James 19
To extract the first <table>
element, we use html_node()
to select the first node with the CSS selector table
and then use html_table()
with the node to get a data frame:
html_table(html_node(single_table_page, "table")) ## Name Age ## 1 Jenny 18 ## 2 James 19
A more natural way to do this is to use pipelines, just like using %>%
with dplyr
functions introduced in Chapter 12, Data Manipulation. Recall that %>%
basically evaluates x %>% f(...)
as f(x, ...)
so that a nested call can be unnested and become much more readable. The preceding code can be rewritten as the following using %>%
:
single_table_page %>% html_node("table") %>% html_table() ## Name Age ## 1 Jenny 18 ## 2 James 19
Now we read data/products.html
and use html_nodes()
to match the <span class="name">
nodes:
products_page <- read_html("data/products.html") products_page %>% html_nodes(".product-list li .name") ## {xml_nodeset (3)} ## [1] <span class="name">Product-A</span> ## [2] <span class="name">Product-B</span> ## [3] <span class="name">Product-C</span>
Note that the nodes we want to select are of the name
class in <li>
nodes of a node of the product-list
class, therefore we can use .product-list li.name
to select all such nodes. Go through the CSS table if you feel you are not familiar with the notation.
To extract the contents from the selected nodes, we use html_text()
, which returns a character vector:
products_page %>% html_nodes(".product-list li .name") %>% html_text() ## [1] "Product-A" "Product-B" "Product-C"
Similarly, the following code extracts the product prices:
products_page %>% html_nodes(".product-list li .price") %>% html_text() ## [1] "$199.95" "$129.95" "$99.95"
In the preceding code, html_nodes()
returns a collection of HTML nodes while html_text()
is smart enough to extract the inner text from each HTML node and returns a character vector.
Note that these prices are still in their raw format represented by a string rather than number. The following code extracts the same data and transforms it into a more useful form:
product_items <- products_page %>% html_nodes(".product-list li") products <- data.frame( name = product_items %>% html_nodes(".name") %>% html_text(), price = product_items %>% html_nodes(".price") %>% html_text() %>% gsub("$", "", ., fixed = TRUE) %>% as.numeric(), stringsAsFactors = FALSE ) products ## name price ## 1 Product-A 199.95 ## 2 Product-B 129.95 ## 3 Product-C 99.95
Note that the intermediate results of selected nodes can be stored as a variable and used repeatedly. Then the subsequent html_nodes()
and html_node()
calls only match the inner nodes.
Since product prices should be numeric values, we use gsub()
to remove $
from the raw prices and convert the results to a numeric vector. The call of gsub()
in the pipeline is somehow special because the previous result (represented by .
) should be put to the third argument instead of the first one.
In this case, .product-list li .name
can be reduced to .name
and the same also applies to .product-list li .price
. In practice, however, a CSS class may be used extensively and such a general selector may match too many elements that are not desired. Therefore, it is better to use a more descriptive and sufficiently strict selector to match the interested nodes.