R provides a platform with easy access to statistical computing and data analysis. Given a data set, it is handy to perform data transformation and apply analytic models and numeric methods with either flexible data structures or high performance, as discussed in previous chapters.
However, the input data set is not always as immediately available as tables provided by well-organized commercial databases. Sometimes, we have to collect data by ourselves. Web content is an important source of data for a wide range of research fields. To collect (scrape or harvest) data from the Internet, we need appropriate techniques and tools. In this chapter, we'll introduce the basic knowledge and tools of web scraping, including:
Web pages are made to present information. The following screenshot shows a simple web page located at data/simple-page.html
that has a heading and a paragraph:
All modern web browsers support such web pages. If you open data/simple-page.html
with any text editor, it will show the code behind the web page as follows:
<!DOCTYPE html> <html> <head> <title>Simple page</title> </head> <body> <h1>Heading 1</h1> <p>This is a paragraph.</p> </body> </html>
The preceding code is an example of HTML (Hyper Text Markup Language). It is the most widely used language on the Internet. Different from any programming language to be finally translated into computer instructions, HTML describes the layout and content of a web page, and web browsers are designed to render the code into a web page according to web standards.
Modern web browsers use the first line of HTML to determine which standard is used to render the web page. In this case, the latest standard, HTML 5, is used.
If you read through the code, you'll probably notice that HTML is nothing but a nested structure of tags such as <html>
, <title>
, <body>
, <h1>
, and <p>
. Each tag begins with <tag>
and is closed with </tag>
.
In fact, these tags are not arbitrarily named, nor are they allowed to contain other arbitrary tags. Each has a specific meaning to the web browser and is only allowed to contain a subset of tags, or even none.
The <html>
tag is the root element of all HTML. It most commonly contains <head>
and <body>
. The <head>
tag usually contains <title>
to show on the title bar and browser tabs and other metadata of the web page, while <body>
plays the main role in determining the layout and contents of the web page.
In the <body>
tag, tags can be nested more freely. The simple page only contains a level-1 heading (<h1>
) and a paragraph (<p>
) while the following web page contains a table with two rows and two columns:
The HTML code behind the web page is stored in data/single-table.html
:
<!DOCTYPE html> <html> <head> <title>Single table</title> </head> <body> <p>The following is a table</p> <table id="table1" border="1"> <thead> <tr> <th>Name</th> <th>Age</th> </tr> </thead> <tbody> <tr> <td>Jenny</td> <td>18</td> </tr> <tr> <td>James</td> <td>19</td> </tr> </tbody> </table> </body> </html>
Note that a <table>
tag is structured row by row: <tr>
represents a table row, <th>
a table header cell, and <td>
a table cell.
Also notice that an HTML element such as <table>
may have additional attributes in the form of <table attr1="value1" attr2="value2">
. The attributes are not arbitrarily defined. Instead, each has a specific meaning according to the standard. In the preceding code, id
is the identifier of the table and border
controls its border width.
The following page looks different from the previous ones in that it shows some styling of contents:
If you take a look at its source code at data/simple-products.html
, you'll find some new tags such as <div>
(a section), <ul>
(unrecorded list), <li>
(list item), and <span>
(also a section used for applying styles); additionally, many HTML elements have an attribute called style
to define their appearance:
<!DOCTYPE html> <html> <head> <title>Products</title> </head> <body> <h1 style="color: blue;">Products</h1> <p>The following lists some products</p> <div id="table1" style="width: 50px;"> <ul> <li> <span style="font-weight: bold;">Product-A</span> <span style="color: green;">$199.95</span> </li> <li> <span style="font-weight: bold;">Product-B</span> <span style="color: green;">$129.95</span> </li> <li> <span style="font-weight: bold;">Product-C</span> <span style="color: green;">$99.95</span> </li> </ul> </div> </body> </html>
Values in style is written in the form of property1: value1; property2: value2;
. However, the styles of the list items are a bit redundant because all product names share the same style and this is also true for all product prices. The following HTML at data/products.html
uses CSS (Cascading Style Sheets) instead to avoid redundant styling definitions:
<!DOCTYPE html> <html> <head> <title>Products</title> <style> h1 { color: darkblue; } .product-list { width: 50px; } .product-list li.selected .name { color: 1px blue solid; } .product-list .name { font-weight: bold; } .product-list .price { color: green; } </style> </head> <body> <h1>Products</h1> <p>The following lists some products</p> <div id="table1" class="product-list"> <ul> <li> <span class="name">Product-A</span> <span class="price">$199.95</span> </li> <li class="selected"> <span class="name">Product-B</span> <span class="price">$129.95</span> </li> <li> <span class="name">Product-C</span> <span class="price">$99.95</span> </li> </ul> </div> </body> </html>
Note that we add <style>
in <head>
to declare a global stylesheet in the web page. We also switch style
to class
for content elements (div
, li
, and span
) to use those pre-defined styles. The syntax of CSS is briefly introduced in the following code.
Match all <h1>
elements:
h1 { color: darkblue; }
Match all elements with the product-list
class:
.product-list { width: 50px; }
Match all elements with the product-list
class, and then match all nested elements with the name
class:
.product-list .name { font-weight: bold; }
Match all elements with the product-list
class, then match all nested <li>
elements with the selected
class, and finally match all nested elements with the name
class:
.product-list li.selected .name { color: 1px blue solid; }
Note that simply using style
cannot achieve this. The following screenshot shows the rendered web page:
Each CSS entry consists of a CSS selector (for example, .product-list
) to match HTML elements and the styles (for example, color: red;
) to apply. CSS selectors are not only used to apply styling, but are also commonly used to extract contents from web pages so the HTML elements of interest are properly matched. This is an underlying technique behind web scraping.
CSS is much richer than demonstrated in the preceding code. For web scraping, we use the following examples to show the most commonly used CSS selectors:
Syntax |
Match |
|
All elements |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
First |
|
Last |
|
3rd |
|
Next element of |
|
|
|
<table border="1"> |
In each level, tag#id.class[]
can be used with tag
, #id.class
, and []
optionally. For more information on CSS selectors, visit https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors. To learn more about HTML tags, visit http://www.w3schools.com/tags/.