Extracting data from web pages using CSS selectors

In R, the easiest-to-use package for web scraping is rvest. Run the following code to install the package from CRAN:

install.packages("rvest") 

First, we load the package and use read_html() to read data/single-table.html and try to extract the table from the web page:

library(rvest) 
## Loading required package: xml2 
single_table_page <- read_html("data/single-table.html") 
single_table_page 
## {xml_document} 
## <html> 
## [1] <head>
  <title>Single table</title>
</head> 
## [2] <body>
  <p>The following is a table</p>
  <table i ... 

Note that single_table_page is a parsed HTML document, which is a nested data structure of HTML nodes.

A typical process for scraping information from such a web page using rvest functions is: First, locate the HTML nodes from which we need to extract data. Then, use either the CSS selector or XPath expression to filter the HTML nodes so that the nodes we need are selected and those we don't need are omitted. Finally, use proper selectors with html_nodes() to take a subset of nodes, html_attrs() to extract attributes, and html_text() to extract text from the parsed web page.

The package also provides simple functions that directly extract data from a web page and return a data frame. For example, to extract all <table> elements from it, we directly call html_table():

html_table(single_table_page) 
## [[1]] 
##    Name Age 
## 1 Jenny  18 
## 2 James  19 

To extract the first <table> element, we use html_node() to select the first node with the CSS selector table and then use html_table() with the node to get a data frame:

html_table(html_node(single_table_page, "table")) 
##    Name Age 
## 1 Jenny  18 
## 2 James  19 

A more natural way to do this is to use pipelines, just like using %>% with dplyr functions introduced in Chapter 12, Data Manipulation. Recall that %>% basically evaluates x %>% f(...) as f(x, ...) so that a nested call can be unnested and become much more readable. The preceding code can be rewritten as the following using %>%:

single_table_page %>% 
  html_node("table") %>% 
  html_table() 
##    Name Age 
## 1 Jenny  18 
## 2 James  19 

Now we read data/products.html and use html_nodes() to match the <span class="name"> nodes:

products_page <- read_html("data/products.html") 
products_page %>% 
  html_nodes(".product-list li .name") 
## {xml_nodeset (3)} 
## [1] <span class="name">Product-A</span> 
## [2] <span class="name">Product-B</span> 
## [3] <span class="name">Product-C</span> 

Note that the nodes we want to select are of the name class in <li> nodes of a node of the product-list class, therefore we can use .product-list li.name to select all such nodes. Go through the CSS table if you feel you are not familiar with the notation.

To extract the contents from the selected nodes, we use html_text(), which returns a character vector:

products_page %>% 
  html_nodes(".product-list li .name") %>% 
  html_text() 
## [1] "Product-A" "Product-B" "Product-C" 

Similarly, the following code extracts the product prices:

products_page %>% 
  html_nodes(".product-list li .price") %>% 
  html_text() 
## [1] "$199.95" "$129.95" "$99.95" 

In the preceding code, html_nodes() returns a collection of HTML nodes while html_text() is smart enough to extract the inner text from each HTML node and returns a character vector.

Note that these prices are still in their raw format represented by a string rather than number. The following code extracts the same data and transforms it into a more useful form:

product_items <- products_page %>% 
  html_nodes(".product-list li") 
products <- data.frame( 
  name = product_items %>% 
    html_nodes(".name") %>% 
    html_text(), 
  price = product_items %>% 
    html_nodes(".price") %>% 
    html_text() %>% 
    gsub("$", "", ., fixed = TRUE) %>% 
    as.numeric(), 
  stringsAsFactors = FALSE 
) 
products 
##        name  price 
## 1 Product-A 199.95 
## 2 Product-B 129.95 
## 3 Product-C  99.95 

Note that the intermediate results of selected nodes can be stored as a variable and used repeatedly. Then the subsequent html_nodes() and html_node() calls only match the inner nodes.

Since product prices should be numeric values, we use gsub() to remove $ from the raw prices and convert the results to a numeric vector. The call of gsub() in the pipeline is somehow special because the previous result (represented by .) should be put to the third argument instead of the first one.

In this case, .product-list li .name can be reduced to .name and the same also applies to .product-list li .price. In practice, however, a CSS class may be used extensively and such a general selector may match too many elements that are not desired. Therefore, it is better to use a more descriptive and sufficiently strict selector to match the interested nodes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset