Working with online data and services

With growing amounts of data available from web-based sources, it is increasingly important for machine learning projects to be able to access and interact with online services. R is able to read data from online sources natively, with some caveats. First, by default, R cannot access secure websites (those using https:// rather than the http:// protocol). Secondly, it is important to note that most web pages do not provide data in a form that R can understand. The data will need to be parsed, or broken apart and rebuilt into a structured form before it can be useful. We'll discuss the workarounds shortly.

However, if neither of these caveats apply, that is, if the data are already online in a non-secure website and in a tabular form like CSV that R can understand natively, then R's read.csv() and read.table() functions can access it from the web just as if it were on your local machine. Simply supply the full Uniform Resource Locator (URL) for the dataset as follows:

> mydata <- read.csv("http://www.mysite.com/mydata.csv")

R also provides functionality for downloading other files from the web, even if R cannot use them directly. For a text file, try the readLines() function as follows:

> mytext <- readLines("http://www.mysite.com/myfile.txt")

For other types of files, the download.file() function can be used. To download a file to R's current working directory, simply supply the URL and destination filename as follows:

> download.file("http://www.mysite.com/myfile.zip", "myfile.zip")

Beyond this base functionality, there are numerous packages that extend R's capabilities for working with online data, the most basic of which will be covered in the sections that follow. Because the web is massive and ever changing, these sections are far from a comprehensive set of all the ways R can connect to online data. There are literally hundreds of packages for everything from niche to massive projects.

Note

For the most complete and up-to-date list of packages, refer to the regularly updated CRAN Web Technologies and Services task view at http://cran.r-project.org/web/views/WebTechnologies.html.

Downloading the complete text of web pages

The RCurl package by Duncan Temple Lang provides a more robust way of accessing web pages by providing an R interface to the curl (client for URLs) utility, a command-line tool for transferring data over networks. The curl program is a widely used tool that acts much like a programmable web browser; given a set of commands, it can access and download the content of nearly anything available on the web. And unlike R, it can access secure websites, as well as post instructions to online forms. It is an incredibly powerful utility.

Note

Precisely because it is so powerful, a complete curl tutorial is outside the scope of this chapter. Instead, refer to the online RCurl documentation at http://www.omegahat.net/RCurl/.

After installing and loading the RCurl package, downloading a page is as simple as typing:

> packt_page <- getURL("https://www.packtpub.com/")

This will save the full text of the Packt Publishing homepage (including all web markup) into the R character object named packt_page. As shown in the following lines, this is not very useful as-is:

> str(packt_page, nchar.max = 200)
 chr "<!DOCTYPE html>
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
    <head>
        <title>Packt Publishing | Technology Books, eBooks & Videos</title>"| __truncated__

The reason that the first 200 characters of the page look like nonsense is that websites are written using Hypertext Markup Language (HTML), which combines the page text with special tags that tell web browsers how to display it. The <title> and </title> tags here surround the page's title, telling the browser that this is the Packt Publishing homepage. Similar tags are used to denote other portions of the page.

Though curl is the cross-platform standard for accessing online content, if you work with web data frequently in R, the httr package by Hadley Wickham builds upon the foundation of RCurl to make accessing HTML web data more convenient and R-like. Rather than using RCurl, the httr package uses its own curl package behind the scenes to retrieve the website data. We can see some of the differences immediately by attempting to download the Packt Publishing homepage using the httr package's GET() function:

> library(httr)
> packt_page <- GET("https://www.packtpub.com")
> str(packt_page, max.level = 1)
List of 10
 $ url        : chr "https://www.packtpub.com/"
 $ status_code: int 200
 $ headers    :List of 11
  ..- attr(*, "class")= chr [1:2] "insensitive" "list"
 $ all_headers:List of 1
 $ cookies    :'data.frame':    0 obs. of  7 variables:
 $ content    : raw [1:162392] 3c 21 44 4f ...
 $ date       : POSIXct[1:1], format: "2019-02-24 23:41:59"
 $ times      : Named num [1:6] 0 0.00372 0.16185 0.45156...
  ..- attr(*, "names")= chr [1:6] "redirect" "namelookup" "connect" "pretransfer" ...
 $ request    :List of 7
  ..- attr(*, "class")= chr "request"
 $ handle     :Class 'curl_handle' <externalptr>
 - attr(*, "class")= chr "response"

Where the getURL() function in RCurl downloaded only the HTML, the httr package's GET() function returns a list with properties of the query in addition to the HTML. To access the page content itself, we need to use the content() function:

> str(content(packt_page, type = "text"), nchar.max = 200)
 chr "<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
	<head>
		<title>Packt Publishing | Technology Books, eBooks & Videos</title>
		<script>
			data"| __truncated__

In order to use this data in an R program, it is necessary to process the HTML data to structure it in a format such as a list or data frame. Functions for doing so are discussed in the sections that follow.

Note

For detailed httr documentation and tutorials, visit the project homepage at https://httr.r-lib.org. The quick start guide is particularly helpful for learning the base functionality.

Parsing the data within web pages

Because there is a consistent structure to the HTML tags of many web pages, it is possible to write programs that look for desired sections of the page and extracts them for compilation into a dataset. This practice of harvesting data from websites and transforming it into a structured form is known as web scraping.

Tip

Though frequently used, web scraping should be considered a last resort for getting data from the web. This is because any changes to the underlying HTML structure may break your code, requiring effort to fix, or worse, introducing unnoticed errors into your data. Additionally, many websites' terms of use agreements explicitly forbid automated data extraction, not to mention the fact that your program's traffic may overload their servers. Always check the site's terms of use before beginning your project; you may even find that the site offers their data freely via a developer agreement. You may also keep a lookout for a file named robots.txt, which is a web standard that describes what parts of a site bots are allowed to crawl.

The rvest package (a pun on the term "harvest") by Hadley Wickham makes web scraping a largely effortless process, assuming the data you want can be found in a consistent place within the HTML.

Let's start with a simple example using the Packt Publishing homepage. We begin by downloading the page as before, using the read_html() function in the rvest package. Note that this function, when supplied with a URL, simply calls the GET() function in Hadley Wickham's httr package:

> library(rvest)
> packt_page <- read_html("https://www.packtpub.com")

Suppose we'd like to scrape the page title; looking at the previous HTML code, we know that there is only one title per page, wrapped within <title> and </title> tags. To pull the title, we supply the tag name to the html_node() function, then a second html_text() function that translates the result to plain text:

> html_node(packt_page, "title") %>% html_text()
[1] "Packt Publishing | Technology Books, eBooks & Videos"

Notice the use of the %>% pipe operator. Just like with the base dplyr package, the use of pipes allows the creation of powerful chains of functions for processing HTML data with rvest.

Let's try a slightly more interesting example. Suppose we'd like to scrape a list of all packages on the CRAN machine learning task view. We begin as before by downloading the HTML page using the read_html() function:

> library(rvest)
> cran_ml <- read_html("http://cran.r-project.org/web/views/MachineLearning.html")

If we view the source of the website in a web browser, one section appears to have the data we're interested in. Note that only a subset of the output is shown here:

  <h3>CRAN packages:</h3>
  <ul>
    <li><a href="../packages/ahaz/index.html">ahaz</a></li>
    <li><a href="../packages/arules/index.html">arules</a></li>
    <li><a href="../packages/bigrf/index.html">bigrf</a></li>
    <li><a href="../packages/bigRR/index.html">bigRR</a></li>
    <li><a href="../packages/bmrm/index.html">bmrm</a></li>
    <li><a href="../packages/Boruta/index.html">Boruta</a></li>
    <li><a href="../packages/bst/index.html">bst</a></li>
    <li><a href="../packages/C50/index.html">C50</a></li>
    <li><a href="../packages/caret/index.html">caret</a></li>

The <h3> tags imply a heading of level 3, while the <ul> and <li> tags refer to the creation of an unordered list (that is, bulleted as opposed to ordered/numbered) and list items, respectively. The data elements we want are surrounded by <a> tags, which are hyperlink anchor tags that link to the CRAN page for each package.

Tip

Because the CRAN page is actively maintained and may be changed at any time, do not be surprised if your results differ from those shown here.

With this knowledge in hand, we can scrape the links much like we did previously. The one exception is that because we expect to find more than one result, we need to use the html_nodes() function to return a vector of results rather than html_node(), which returns only a single item. The following function call returns <a> tags nested within <li> tags:

> ml_packages <- html_nodes(cran_ml, "li a")

Let's peek at the result using the head() function:

> head(ml_packages, n = 5)
{xml_nodeset (5)}
[1] <a href="../packages/nnet/index.html">nnet</a>
[2] <a href="../packages/RSNNS/index.html">RSNNS</a>
[3] <a href="../packages/rnn/index.html">rnn</a>
[4] <a href="../packages/deepnet/index.html">deepnet</a>
[5] <a href="../packages/RcppDL/index.html">RcppDL</a>

The result includes the <a> HTML output. To eliminate this, simply pipe the result into the html_text() function. The result is a vector containing the names of all packages listed in the CRAN machine learning task view, here piped into the head() function to display only its first few values:

> ml_packages %>% html_text() %>% head()
[1] "nnet"    "RSNNS"   "rnn"     "deepnet" "RcppDL"  "h2o"

These are simple examples that merely scratch the surface of what is possible with the rvest package. Using the pipe functionality, it is possible to look for tags nested within tags, or specific classes of HTML tags. For these types of complex examples, refer to the package's documentation.

Tip

In general, web scraping is always a process of iterate-and-refine as you identify more specific criteria to exclude or include specific cases. The most difficult cases may even require a human eye to achieve 100 percent accuracy.

Parsing XML documents

XML is a plaintext, human-readable, but structured markup language upon which many document formats have been based. It employs a tagging structure in some ways similar to HTML, but is far stricter about formatting. For this reason, it is a popular online format for storing structured datasets.

The XML package by Duncan Temple Lang provides a suite of R functionality based on the popular C-based libxml2 parser for reading and writing XML documents. It is the grandfather of XML parsing packages in R and is still widely used.

Note

Information on the XML package, including simple examples to get you started quickly, can be found at the project's website at http://www.omegahat.net/RSXML/.

More recently, the xml2 package by Hadley Wickham has surfaced as an easier and more R-like interface to the libxml2 library. The rvest package, which was covered earlier in this chapter, utilizes xml2 behind-the-scenes to parse HTML; therefore, rvest can obviously also be used to parse XML.

Note

The xml2 homepage is found at http://xml2.r-lib.org.

Because parsing XML is so closely related to parsing HTML, the exact syntax is not covered here. Please refer to these packages' documentation for examples.

Parsing JSON from web APIs

Online applications communicate to one another using web-accessible functions known as application programming interfaces (API). These interfaces act much like a typical website; they receive a request from a client at a particular URL and return a response. The difference is that where a normal website returns HTML meant for display in a web browser, an API typically returns data in a structured form meant for processing by a machine.

Though it is not uncommon to find XML-based APIs, perhaps the most common API data structure today is JavaScript Object Notation (JSON). Like XML, this is a standard, plaintext format, most often used for data structures and objects on the web. The format has become popular recently due to its roots in browser-based JavaScript applications, but despite the pedigree, its utility is not limited to the web. The ease with which JSON data structures can be understood by humans and parsed by machines makes it an appealing data structure for many types of projects.

JSON is based on a simple {key: value} format. The { } brackets denote a JSON object, and the key and value denote a property of the object and the status of that property. An object can have any number of properties, and the properties themselves may be objects. For example, a JSON object for this book might look something like this:

{
  "title": "Machine Learning with R",
  "author": "Brett Lantz",
  "publisher": {
     "name": "Packt Publishing",
     "url": "https://www.packtpub.com"
  },
  "topics": ["R", "machine learning", "data mining"],
  "MSRP": 54.99
}

This example illustrates the data types available to JSON: numeric, character, array (surrounded by [ and ] characters), and object. Not shown are the null and Boolean (true or false) values. The transmission of these types of objects from application to application, and application to web browser, is what powers many of the most popular websites.

Note

For details about the JSON format, visit http://www.json.org/.

There are a number of packages that can convert to and from JSON data. The jsonlite package by Jeroen Ooms quickly gained prominence because it creates data structures that are more consistent and R-like than other packages, especially when using data from web APIs. For detailed information on how to use this package, visit its GitHub page at https://github.com/jeroen/jsonlite.

After installing the jsonlite package, to convert from an R object to a JSON string, we use the toJSON() function. Notice that in the output the quote characters have been escaped using the " notation:

> library(jsonlite)
> ml_book <- list(book_title = "Machine Learning with R",
                  author = "Brett Lantz")
> toJSON(ml_book)
{"book_title":["Machine Learning with R"],
 "author":["Brett Lantz"]}

To convert a JSON string to an R object, use the fromJSON() function. Quotes in the string need to be escaped as shown:

> ml_book_json <- "{
  "title": "Machine Learning with R",
  "author": "Brett Lantz",
  "publisher": {
    "name": "Packt Publishing",
    "url": "https://www.packtpub.com"
  },
  "topics": ["R", "machine learning", "data mining"],
  "MSRP": 54.99
}"

> ml_book_r <- fromJSON(ml_book_json)

This results in a list structure in a form much like the JSON:

> str(ml_book_r)
List of 5
 $ title    : chr "Machine Learning with R"
 $ author   : chr "Brett Lantz"
 $ publisher:List of 2
  ..$ name: chr "Packt Publishing"
  ..$ url : chr "https://www.packtpub.com"
 $ topics   : chr [1:3] "R" "machine learning" "data mining"
 $ MSRP     : num 55

Note

For more information on the jsonlite package, see The jsonlite Package: A Practical and Consistent Mapping Between JSON Data and R Objects, Ooms, J, 2014. Available at http://arxiv.org/abs/1403.2805.

Public facing APIs allow programs like R to systematically query websites to retrieve results in JSON format using packages like Rcurl and httr. Nearly all websites that provide interesting data offer APIs for querying, although some charge fees and require an access key for doing so. Popular examples include the APIs for Twitter, Google Maps, and Facebook. Though a full tutorial for using web APIs is worthy of a separate book, the basic process relies on only a couple of steps—it's the details that are tricky.

Suppose we wanted to query the Apple iTunes API to find the albums released by The Beatles. We first need to review the iTunes API documentation at https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/ to determine the URL and parameters needed to make this query. We then supply this information to the httr package GET() function, adding a list of query parameters in to send to the search URL:

> library(httr)
> music_search <- GET("https://itunes.apple.com/search",
                      query = list(term = "Beatles",
                                   media = "music",
                                   entity = "album",
                                   limit = 10))

By typing the name of the resulting object, we can see some details about the request:

> music_search
Response [https://itunes.apple.com/search?term=Beatles&media=music&entity=album&limit=10]
  Date: 2019-02-25 00:33
  Status: 200
  Content-Type: text/javascript; charset=utf-8
  Size: 9.75 kB
{
 "resultCount":10,
 "results": [
{"wrapperType":"collection", "collectionType":"Album", "artistId":136975, "collectionId":402060584, "amgArtistId":3644, "artistName":"The B...

To access the resulting JSON, we use the content() function, which we can then convert to an R object with the jsonlite package's fromJSON() function:

> library(jsonlite)
> music_results <- fromJSON(content(music_search))

The music_results object is a list containing the data returned from the iTunes API. Although the results are too large to print here, the str(music_results) command will display the structure of this object, which shows that the interesting data is stored as sub-objects within the results object. For example, the vector of album titles can be found as the collectionName sub-object:

> music_results$results$collectionName
 [1] "The Beatles Box Set"                   
 [2] "Abbey Road"                            
 [3] "The Beatles (White Album)"             
 [4] "The Beatles 1967-1970 (The Blue Album)"
 [5] "1 (2015 Version)"                      
 [6] "Sgt. Pepper's Lonely Hearts Club Band" 
 [7] "The Beatles 1962-1966 (The Red Album)" 
 [8] "Revolver"                              
 [9] "Rubber Soul"                           
[10] "Love"

These data elements could then be used in an R program as desired.

Tip

Because the Apple iTunes API may be updated in the future, if you find that your results differ from those shown here, please check the Packt Publishing support page for this book for updated code.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset