Chapter 12. Getting Data

Data can come from many sources. R comes with many datasets built in, and there is more data in many of the add-on packages. R can read data from a wide variety of sources and in a wide variety of formats. This chapter covers importing data from text files (including spreadsheet-like data in comma- or tab-delimited format, XML, and JSON), binary files (Excel spreadsheets and data from other analysis software), websites, and databases.

Chapter Goals

After reading this chapter, you should:

  • Be able to access datasets provided with R packages
  • Be able to import data from text files
  • Be able to import data from binary files
  • Be able to download data from websites
  • Be able to import data from a database

Built-in Datasets

One of the packages in the base R distribution is called datasets, and it is entirely filled with example datasets. While you’ll be lucky if any of them are suited to your particular area of research, they are ideal for testing your code and for exploring new techniques. Many other packages also contain datasets. You can see all the datasets that are available in the packages that you have loaded using the data function:

data()

For a more complete list, including data from all packages that have been installed, use this invocation:

data(package = .packages(TRUE))

To access the data in any of these datasets, call the data function, this time passing the name of the dataset and the package where the data is found (if the package has been loaded, then you can omit the package argument):

data("kidney", package = "survival")

Now the kidney data frame can be used just like your own variables:

head(kidney)
##   id time status age sex disease frail
## 1  1    8      1  28   1   Other   2.3
## 2  1   16      1  28   1   Other   2.3
## 3  2   23      1  48   2      GN   1.9
## 4  2   13      0  48   2      GN   1.9
## 5  3   22      1  32   1   Other   1.2
## 6  3   28      1  32   1   Other   1.2

Reading Text Files

There are many, many formats and standards of text documents for storing data. Common formats for storing data are delimiter-separated values (CSV or tab-delimited), eXtensible Markup Language (XML), JavaScript Object Notation (JSON), and YAML (which recursively stands for YAML Ain’t Markup Language). Other sources of text data are less well-structured—a book, for example, contains text data without any formal (that is, standardized and machine parsable) structure.

The main advantage of storing data in text files is that they can be read by more or less all other data analysis software and by humans. This makes your data more widely reusable by others.

CSV and Tab-Delimited Files

Rectangular (spreadsheet-like) data is commonly stored in delimited-value files, particularly comma-separated values (CSV) and tab-delimited values files. The read.table function reads these delimited files and stores the results in a data frame. In its simplest form, it just takes the path to a text file and imports the contents.

RedDeerEndocranialVolume.dlm is a whitespace-delimited file containing measurements of the endocranial volume of some red deer, measured using different techniques. (For those of you with an interest in deer skulls, the methods are computer tomography; filling the skull with glass beads; measuring the length, width, and height with calipers; and using Finarelli’s equation. A second measurement was taken in some cases to get an idea of the accuracy of the techniques. I’ve been assured that the deer were already long dead before they had their skulls filled with beads!) The data file can be found inside the extdata folder in the learningr package. The first few rows are shown in Table 12-1.

Table 12-1. Sample data from RedDeerEndocranialVolume.dlm

SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2

DIC44

389

375

1484

337

B11

389

370

1722

377

DIC90

352

345

1495

328

DIC83

388

370

1683

377

DIC787

375

355

1458

328

DIC1573

325

320

1363

291

C120

346

335

1250

289

346

330

1264

C25

302

295

1011

250

303

295

1009

F7

379

360

1621

347

375

365

1647

The data has a header row, so we need to pass the argument header = TRUE to read.table. Since a second measurement wasn’t always taken, not all the lines are complete. Passing the argument fill = TRUE makes read.table substitute NA values for the missing fields. The system.file function in the following example is used to locate files that are inside a package (in this case, the RedDeerEndocranialVolume.dlm file in the extdata folder of the package learningr).

library(learningr)
deer_file <- system.file(
  "extdata",
  "RedDeerEndocranialVolume.dlm",
  package = "learningr"
)
deer_data <- read.table(deer_file, header = TRUE, fill = TRUE)
str(deer_data, vec.len = 1)      #vec.len alters the amount of output
## 'data.frame':        33 obs. of  8 variables:
##  $ SkullID     : Factor w/ 33 levels "A4","B11","B12",..: 14  ...
##  $ VolCT       : int  389 389  ...
##  $ VolBead     : int  375 370  ...
##  $ VolLWH      : int  1484 1722  ...
##  $ VolFinarelli: int  337 377  ...
##  $ VolCT2      : int  NA NA  ...
##  $ VolBead2    : int  NA NA  ...
##  $ VolLWH2     : int  NA NA  ...
head(deer_data)
##   SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2
## 1   DIC44   389     375   1484          337     NA       NA      NA
## 2     B11   389     370   1722          377     NA       NA      NA
## 3   DIC90   352     345   1495          328     NA       NA      NA
## 4   DIC83   388     370   1683          377     NA       NA      NA
## 5  DIC787   375     355   1458          328     NA       NA      NA
## 6 DIC1573   325     320   1363          291     NA       NA      NA

Notice that the class of each column has been automatically determined, and row and column names have been automatically assigned. The column names are (by default) forced to be valid variable names (via make.names), and if row names aren’t provided the rows are simply numbered 1, 2, 3, and so on.

There are lots of arguments to specify how the file will be read; perhaps the most important is sep, which determines the character to use as a separator between fields. You can also specify how many lines of data to read via nrow, and how many lines at the start of the file to skip. More advanced options include the ability to override the default row names, column names, and classes, and to specify the character encoding of the input file and how string input columns should be declared.

There are several convenience wrapper functions to read.table. read.csv sets the default separator to a comma, and assumes that the data has a header row. read.csv2 is its European cousin, using a comma for decimal places and a semicolon as a separator. Likewise read.delim and read.delim2 import tab-delimited files with full stops[37] or commas for decimal places, respectively.

Back in August 2008, scientists from the Centre for Environment, Fisheries, and Aquaculture Science (CEFAS) in Lowestoft, UK, attached a tag with a pressure sensor and a temperature sensor to a brown crab and dropped it into the North Sea. The crab then spent just over a year doing crabby things[38] before being caught by fishermen, who returned the tag to CEFAS.

The data from this tag is stored in a CSV file, along with some metadata. The first few rows of the file look like this:

Comment :- clock reset to download data

The following data are the ID block contents

Firmware Version No

2

Firmware Build Level

70

The following data are the Tag notebook contents

Mission Day

405

Last Deployment Date

08/08/2008 09:55:00

Deployed by Host Version

5.2.0

Downloaded by Host Version

6.0.0

Last Clock Set Date

05/01/2010 10:34:00

The following data are the Lifetime notebook contents

Tag ID

A03401

Pressure Range

10

No of sensors

2

In this case, we can’t just read everything with a single call to read.csv, since different pieces of data have different numbers of fields, and indeed different fields. We need to use the skip and nrow arguments of read.csv to specify which bits of the file to read:

crab_file <- system.file(
  "extdata",
  "crabtag.csv",
  package = "learningr"
)
(crab_id_block <- read.csv(
  crab_file,
  header = FALSE,
  skip = 3,
  nrow = 2
))
##                     V1 V2
## 1  Firmware Version No  2
## 2 Firmware Build Level 70
(crab_tag_notebook <- read.csv(
  crab_file,
  header = FALSE,
  skip = 8,
  nrow = 5
))
##                           V1                  V2
## 1                Mission Day                 405
## 2       Last Deployment Date 08/08/2008 09:55:00
## 3   Deployed by Host Version               5.2.0
## 4 Downloaded by Host Version               6.0.0
## 5        Last Clock Set Date 05/01/2010 10:34:00
(crab_lifetime_notebook <- read.csv(
  crab_file,
  header = FALSE,
  skip = 15,
  nrow = 3
))
##                V1     V2
## 1          Tag ID A03401
## 2 Pressure Range      10
## 3  No of sensors       2

Tip

The colbycol and sqldf packages contain functions that allow you to read part of a CSV file into R. This can provide a useful speed-up if you don’t need all the columns or all the rows.

For really low-level control when importing this sort of file, you can use the scan function, on which read.table is based. Ordinarily, you should never have to resort to scan, but it can be useful for malformed or nonstandard files.

Tip

If your data has been exported from another language, you may need to pass the na.strings argument to read.table. For data exported from SQL, use na.strings = "NULL". For data exported from SAS or Stata, use na.strings = ".". For data exported from Excel, use na.strings = c("", "#N/A", "#DIV/0!", "#NUM!").

The opposite task, writing files, is generally simpler than reading files, since you don’t need to worry about oddities in the file—you usually want to create something standard. read.table and read.csv have the obviously named counterparts write.table and write.csv.

Both functions take a data frame and a file path to write to. They also provide a few options to customize the output (whether or not to include row names and what the character encoding of the output file should be, for example):

write.csv(
  crab_lifetime_notebook,
  "Data/Cleaned/crab lifetime data.csv",
  row.names    = FALSE,
  fileEncoding = "utf8"
)

Unstructured Text Files

Not all text files have a well-defined structure like delimited files do. If the structure of the file is weak, it is often easier to read in the file as lines of text and then parse or manipulate the contents afterward. readLines (notice the capital “L”) provides such a facility. It accepts a path to a file (or a file connection) and, optionally, a maximum number of lines to read. Here we import the Project Gutenberg version of Shakespeare’s The Tempest:

text_file <- system.file(
  "extdata",
  "Shakespeare's The Tempest, from Project Gutenberg pg2235.txt",
  package = "learningr"
)
the_tempest <- readLines(text_file)
the_tempest[1926:1927]
## [1] "   Ste. Foure legges and two voyces; a most delicate"
## [2] "Monster: his forward voyce now is to speake well of"

writeLines performs the reverse operation to readLines. It takes a character vector and a file to write to:

writeLines(
  rev(text_file),     #rev reverses vectors
  "Shakespeare's The Tempest, backwards.txt"
)

XML and HTML Files

XML files are widely used for storing nested data. Many standard file types and protocols are based upon it, such as RSS (Really Simple Syndication) for news feeds, SOAP (Simple Object Access Protocol) for passing structured data across computer networks, and the XHTML flavor of web pages.

Base R doesn’t ship with the capability to read XML files, but the XML package is developed by an R Core member. Install it now!

install.packages("XML")

When you import an XML file, the XML package gives you a choice of storing the result using internal nodes (that is, objects are stored with C code, the default) or R nodes. Usually you want to store things with internal nodes, because it allows you to query the node tree using XPath (more on this in a moment).

There are several functions for importing XML data, such as xmlParse and some other wrapper functions that use slightly different defaults:

library(XML)
xml_file <- system.file("extdata", "options.xml", package = "learningr")
r_options <- xmlParse(xml_file)

One of the problems with using internal nodes is that summary functions like str and head don’t work with them. To use R-level nodes, set useInternalNodes = FALSE (or use xmlTreeParse, which sets this as the default):

xmlParse(xml_file, useInternalNodes = FALSE)
xmlTreeParse(xml_file)      #the same

XPath is a language for interrogating XML documents, letting you find nodes that correspond to some filter. In this next example we look anywhere in the document (//) for a node named variable where ([]) the name attribute (@) contains the string warn.

xpathSApply(r_options, "//variable[contains(@name, 'warn')]")
## [[1]]
## <variable name="nwarnings" type="numeric">
##   <value>50</value>
## </variable>
##
## [[2]]
## <variable name="warn" type="numeric">
##   <value>0</value>
## </variable>
##
## [[3]]
## <variable name="warning_length" type="numeric">
##   <value>1000</value>
## </variable>

This sort of querying is very useful for extracting data from web pages. The equivalent functions for importing HTML pages are named as you might expect, htmlParse and htmlTreeParse, and they behave in the same way.

XML is also very useful for serializing (a.k.a. “saving”) objects in a format that can be read by most other pieces of software. The XML package doesn’t provide serialization functionality, but it is available via the makexml function in the Runiversal package. The options.xml file was created with this code:

library(Runiversal)
ops <- as.list(options())
cat(makexml(ops), file = "options.xml")

JSON and YAML Files

The main problems with XML are that it is very verbose, and you need to explicitly specify the type of the data (it can’t tell the difference between a string and a number by default), which makes it even more verbose. When file sizes are important (such as when you are transferring big datasets across a network), this verbosity becomes problematic.

YAML and its subset JSON were invented to solve these problems. They are much better suited to transporting many datasets—particularly numeric data and arrays—over networks. JSON is the de facto standard for web applications to pass data between themselves.

There are two packages for dealing with JSON data: RJSONIO and rjson. For a long time rjson had performance problems, so the only package that could be recommended was RJSONIO. The performance issues have now been fixed, so it’s a much closer call. For most cases, it now doesn’t matter which package you use. The differences occur when you encounter malformed or nonstandard JSON.

RJSONIO is generally more forgiving than rjson when reading incorrect JSON. Whether this is a good thing or not depends upon your use case. If you think it is better to import JSON data with minimal fuss, then RJSONIO is best. If you want to be alerted to problems with the JSON data (perhaps it was generated by a colleague—I’m sure you would never generate malformed JSON), then rjson is best.

Fortunately, both packages have identically named functions for reading and writing JSON data, so it is easy to swap between them. In the following example, the double colons, ::, are used to distinguish which package each function should be taken from (if you only load one of the two packages, then you don’t need the double colons):

library(RJSONIO)
library(rjson)
jamaican_city_file <- system.file(
  "extdata",
  "Jamaican Cities.json",
  package = "learningr"
)
(jamaican_cities_RJSONIO <- RJSONIO::fromJSON(jamaican_city_file))
## $Kingston
## $Kingston$population
## [1] 587798
##
## $Kingston$coordinates
## longitude  latitude
##     17.98     76.80
##
##
## $`Montego Bay`
## $`Montego Bay`$population
## [1] 96488
##
## $`Montego Bay`$coordinates
## longitude  latitude
##     18.47     77.92
(jamaican_cities_rjson <- rjson::fromJSON(file = jamaican_city_file))
## $Kingston
## $Kingston$population
## [1] 587798
##
## $Kingston$coordinates
## $Kingston$coordinates$longitude
## [1] 17.98
##
## $Kingston$coordinates$latitude
## [1] 76.8
##
##
##
## $`Montego Bay`
## $`Montego Bay`$population
## [1] 96488
##
## $`Montego Bay`$coordinates
## $`Montego Bay`$coordinates$longitude
## [1] 18.47
##
## $`Montego Bay`$coordinates$latitude
## [1] 77.92

Notice that RJSONIO simplifies the coordinates for each city to be a vector. This behavior can be turned off with simplify = FALSE, resulting in exactly the same object as the one generated by rjson.

Annoyingly, the JSON spec doesn’t allow infinite or NaN values, and it’s a little fuzzy on what a missing number should look like. The two packages deal with these values differently—RJSONIO maps NaN and NA to JSON’s null value but preserves positive and negative infinity, while rjson converts all these values to strings:

special_numbers <- c(NaN, NA, Inf, -Inf)
RJSONIO::toJSON(special_numbers)
## [1] "[ null, null,    Inf,   -Inf ]"
rjson::toJSON(special_numbers)
## [1] "["NaN","NA","Inf","-Inf"]"

Since both these methods are hacks to deal with JSON’s limited spec, if you find yourself dealing with these special number types a lot (or want to write comments in your data object), then you are better off using YAML. The yaml package has two functions for importing YAML data. yaml.load accepts a string of YAML and converts it to an R object, and yaml.load_file does the same, but treats its string input as a path to a file containing YAML:

library(yaml)
yaml.load_file(jamaican_city_file)
## $Kingston
## $Kingston$population
## [1] 587798
##
## $Kingston$coordinates
## $Kingston$coordinates$longitude
## [1] 17.98
##
## $Kingston$coordinates$latitude
## [1] 76.8
##
##
##
## $`Montego Bay`
## $`Montego Bay`$population
## [1] 96488
##
## $`Montego Bay`$coordinates
## $`Montego Bay`$coordinates$longitude
## [1] 18.47
##
## $`Montego Bay`$coordinates$latitude
## [1] 77.92

as.yaml performs the opposite task, converting R objects to YAML strings.

Reading Binary Files

Many pieces of software store their data in binary formats (some proprietary, some conforming to publically defined standards). The binary formats are often smaller than their text equivalents, so performance gains are usually possible by using a binary format, at the expense of human readability.

Many binary file formats are proprietary, which goes against free software principles. If you have the option, it is usually best to avoid such formats in order to stop your data being locked into a platform over which you lack control.

Reading Excel Files

Microsoft Excel is the world’s most popular spreadsheet package, and very possibly the world’s most popular data analysis tool. Unfortunately, its document formats, XLS and XLSX, don’t play very nicely with other software, especially outside of a Windows environment. This means that some experimenting may be required to find a setup that works for your choice of operating system and the particular type of Excel file.

That said , the xlsx package is Java-based and cross-platform, so at least in theory it can read any Excel file on any system. It provides a choice of functions for reading Excel files: spreadsheets can be imported with read.xlsx and read.xlsx2, which do more processing in R and in Java, respectively. The choice of two engines is rather superfluous; you want read.xlsx2, since it’s faster and the underlying Java library is more mature than the R code.

The next example features the fastest times from the Alpe d’Huez mountain stage of the Tour de France bike race, along with whether or not each cyclist has been found guilty of using banned performance-enhancing drugs. The colClasses argument determines what class each column should have in the resulting data frame. It isn’t compulsory, but it saves you having to manipulate the resulting data afterward:

library(xlsx)
bike_file <- system.file(
  "extdata",
  "Alpe d'Huez.xls",
  package = "learningr"
)
bike_data <- read.xlsx2(
  bike_file,
  sheetIndex = 1,
  startRow   = 2,
  endRow     = 38,
  colIndex   = 2:8,
  colClasses = c(
    "character", "numeric", "character", "integer",
    "character", "character", "character"
  )
)
head(bike_data)
##      Time NumericTime            Name Year   Nationality DrugUse
## 1 37' 35"       37.58   Marco Pantani 1997         Italy       Y
## 2 37' 36"       37.60 Lance Armstrong 2004 United States       Y
## 3 38' 00"       38.00   Marco Pantani 1994         Italy       Y
## 4 38' 01"       38.02 Lance Armstrong 2001 United States       Y
## 5 38' 04"       38.07   Marco Pantani 1995         Italy       Y
## 6 38' 23"       38.38     Jan Ullrich 1997       Germany       Y
##                                                    Allegations
## 1 Alleged drug use during 1997 due to high haematocrit levels.
## 2         2004 Tour de France title stripped by USADA in 2012.
## 3 Alleged drug use during 1994 due to high haematocrit levels.
## 4         2001 Tour de France title stripped by USADA in 2012.
## 5 Alleged drug use during 1995 due to high haematocrit levels.
## 6                 Found guilty of a doping offense by the CAS.

The xlsReadWrite package provides an alternative to xlsx, but it currently only works with 32-bit R installations, and only on Windows. There are also some other packages designed to work with Excel. RExcel and excel.link use COM connections to control Excel from R, and WriteXLS uses Perl to write to Excel files. The gnumeric package provides functions for reading Gnumeric spreadsheets.

The counterpart to read.xlsx2 is (you guessed it) write.xlsx2. It works the same way as write.csv, taking a data frame and a filename. Unless you really need to use Excel spreadsheets, you are better off saving your data in a text format for portability, so use this with caution.

Reading SAS, Stata, SPSS, and MATLAB Files

If you are collaborating with statisticians in other organizations, they may try to send you files from another statistical package. The foreign package contains methods to read SAS permanent datasets[39] (SAS7BDAT files) using read.ssd, Stata DTA files with read.dta, and SPSS data files with read.spss. Each of these files can be written with write.foreign.

MATLAB binary data files (Level 4 and Level 5) can be read and written using readMat and writeMat in the R.matlab package.

Reading Other File Types

R can read data from many other types of files.

It can read Hierarchical Data Format v5 [HDF5] files via the h5r package (and the rdhf5 package on Bioconductor), and Network Common Data Format [NetCDF] files via the ncdf package.

It can read ESRI ArcGIS spatial data files via the maptools and shapefiles packages (and older ArcInfo files via the RArcInfo package).

It can read raster picture formats via the jpeg, png, tiff, rtiff, and readbitmap packages.

It can read a variety of genomics data formats using packages on Bioconductor. Most notably, it can read microarray data in GenePix GPR files (a.k.a. axon text files) via the RPPanalyzer package; Variant Call Format (VCF) files for gene sequence variants via the vcf2geno package; binary sequence alignment data via rbamtools (which provides an interface to SAMtools); and Luminex bead array assay files via the lxb package.

Finally, there are loads of miscellaneous formats scattered in other packages. A nonexhaustive list includes the 4dfp and tractor.base packages for MRI images, the IgorR package for WaveMetrics Igor binary format files, the GENEAread package for GENEActiv watch accelerometer data, the emme2 package for Inro Software EMME v2 databank files, the SEER2R package for SEER cancer datasets, the rprotobuf package for Google’s protocol buffers, the readBrukerFlexData package for mass spectrometry data in Bruker flex format, the M3 package for community multiscale air quality models in Models-3 files, and the Read.isi package for ISI codebooks from the World Fertility Survey.

While most of these packages are completely useless to most people, the fact that R can access so many niches in so many different areas is rather astounding.

Web Data

The Internet contains a wealth of data, but it’s hard work (and not very scalable) to manually visit a website, download a data file, and then read it into R from your hard drive.

Fortunately, R has a variety of ways to import data from web sources; retrieving the data programmatically makes it possible to collect much more of it with much less effort.

Sites with an API

Several packages exist that download data directly into R using a website’s application programming interface (API). For example, the World Bank makes its World Development Indicators data publically available, and the WDI package lets you easily import the data without leaving R. To run the next example, you first need to install the WDI package:

install.packages("WDI")
library(WDI)
#list all available datasets
wdi_datasets <- WDIsearch()
head(wdi_datasets)
##      indicator
## [1,] "BG.GSR.NFSV.GD.ZS"
## [2,] "BM.KLT.DINV.GD.ZS"
## [3,] "BN.CAB.XOKA.GD.ZS"
## [4,] "BN.CUR.GDPM.ZS"
## [5,] "BN.GSR.FCTY.CD.ZS"
## [6,] "BN.KLT.DINV.CD.ZS"
##      name
## [1,] "Trade in services (% of GDP)"
## [2,] "Foreign direct investment, net outflows (% of GDP)"
## [3,] "Current account balance (% of GDP)"
## [4,] "Current account balance excluding net official capital grants (% of GDP)"
## [5,] "Net income (% of GDP)"
## [6,] "Foreign direct investment (% of GDP)"
#retrieve one of them
wdi_trade_in_services <- WDI(
  indicator = "BG.GSR.NFSV.GD.ZS"
)
str(wdi_trade_in_services)
## 'data.frame':  984 obs. of  4 variables:
##  $ iso2c            : chr  "1A" "1A" "1A" ...
##  $ country          : chr  "Arab World" "Arab World" "Arab World" ...
##  $ BG.GSR.NFSV.GD.ZS: num  17.5 NA NA NA ...
##  $ year             : num  2005 2004 2003 2002 ...

The SmarterPoland package provides a similar wrapper to Polish government data. quantmod provides access to stock tickers (Yahoo!’s by default, though several other sources are available):

library(quantmod)
#If you are using a version before 0.5.0 then set this option
#or pass auto.assign = FALSE to getSymbols.
options(getSymbols.auto.assign = FALSE)
microsoft <- getSymbols("MSFT")
head(microsoft)
##            MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume
## 2007-01-03     29.91     30.25    29.40      29.86    76935100
## 2007-01-04     29.70     29.97    29.44      29.81    45774500
## 2007-01-05     29.63     29.75    29.45      29.64    44607200
## 2007-01-08     29.65     30.10    29.53      29.93    50220200
## 2007-01-09     30.00     30.18    29.73      29.96    44636600
## 2007-01-10     29.80     29.89    29.43      29.66    55017400
##            MSFT.Adjusted
## 2007-01-03         25.83
## 2007-01-04         25.79
## 2007-01-05         25.64
## 2007-01-08         25.89
## 2007-01-09         25.92
## 2007-01-10         25.66

The twitteR package provides access to Twitter’s users and their tweets. There’s a little bit of setup involved (due to Twitter’s API requiring you to create an application and register it using OAuth; read the vignette for the package for setup instructions), but after that the package makes it easy to import Twitter data for network analysis, or simply look at tweets while pretending to work.

Scraping Web Pages

R has its own web server built in, so some functions for reading data are Internet-enabled by default. read.table (and its derivatives, like read.csv) can accept a URL rather than a local file, and will download a copy to a temp file before importing the data. For example, economic researcher Justin Rao’s website has NBA basketball salary data for 2002 through 2008:

salary_url <- "http://www.justinmrao.com/salary_data.csv"
salary_data <- read.csv(salary_url)
str(salary_data)

Since accessing a large file over the Internet can be slow, if the file is going to be used often, a better strategy is to download the file using download.file to create a local copy, and then import that:

salary_url <- "http://www.justinmrao.com/salary_data.csv"
local_copy <- "my local copy.csv"
download.file(salary_url, local_copy)
salary_data <- read.csv(local_copy)

More advanced access to web pages can be achieved through the RCurl package, which provides access to the libcurl network client interface library. This is particularly useful if your data is contained inside an HTML or XML page, rather than a standard data format (like CSV) that just happens to be on the Web.

The next example retrieves the current date and time in several time zones from the United States Naval Observatory Time Service Department web page. The function getURL retrieves the page as a character string:

library(RCurl)
time_url <- "http://tycho.usno.navy.mil/cgi-bin/timer.pl"
time_page <- getURL(time_url)
cat(time_page)
## <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final"//EN>
## <html>
## <body>
## <TITLE>What time is it?</TITLE>
## <H2> US Naval Observatory Master Clock Time</H2> <H3><PRE>
## <BR>Jul. 17, 20:43:37 UTC            Universal Time
## <BR>Jul. 17, 04:43:37 PM EDT Eastern Time
## <BR>Jul. 17, 03:43:37 PM CDT Central Time
## <BR>Jul. 17, 02:43:37 PM MDT Mountain Time
## <BR>Jul. 17, 01:43:37 PM PDT Pacific Time
## <BR>Jul. 17, 12:43:37 PM AKDT        Alaska Time
## <BR>Jul. 17, 10:43:37 AM HAST        Hawaii-Aleutian Time
## </PRE></H3><P><A HREF="http://www.usno.navy.mil"> US Naval Observatory</A>
##
## </body></html>

The next step is almost always to parse the page using htmlParse (or a related function) from the XML package. That allows you to extract the useful nodes. In the next example, splitting on (newlines) retrieves each time line, and splitting on (tabs) retrieves time/time zone pairs:

library(XML)
time_doc <- htmlParse(time_page)
pre <- xpathSApply(time_doc, "//pre")[[1]]
values <- strsplit(xmlValue(pre), "
")[[1]][-1]
strsplit(values, "	+")
## [[1]]
## [1] "Jul. 17, 20:43:37 UTC" "Universal Time"
##
## [[2]]
## [1] "Jul. 17, 04:43:37 PM EDT" "Eastern Time"
##
## [[3]]
## [1] "Jul. 17, 03:43:37 PM CDT" "Central Time"
##
## [[4]]
## [1] "Jul. 17, 02:43:37 PM MDT" "Mountain Time"
##
## [[5]]
## [1] "Jul. 17, 01:43:37 PM PDT" "Pacific Time"
##
## [[6]]
## [1] "Jul. 17, 12:43:37 PM AKDT" "Alaska Time"
##
## [[7]]
## [1] "Jul. 17, 10:43:37 AM HAST" "Hawaii-Aleutian Time"

The httr package is based on RCurl and provides syntactic sugar to make some tasks go down easier. The httr equivalent of RCurl’s getURL is GET, and the content function retrieves the page’s content, parsing it in the process. In the next example, we pass useInternalNodes = TRUE to mimic the behavior of htmlParse and repeat the action of the previous example:

library(httr)
time_page <- GET(time_url)
time_doc <- content(page, useInternalNodes = TRUE)

Accessing Databases

Where data has to be accessed by many people, it is often best stored in a relational database. There are many database management systems (DBMSs) for working with relational databases, and R can connect to all the common ones. The DBI package provides a unified syntax for accessing several DBMSs—currently SQLite, MySQL/MariaDB, PostgreSQL, and Oracle are supported, as well as a wrapper to the Java Database Connectivity (JDBC) API. (Connections to SQL Server use a different system, as we’ll see below.)

To connect to an SQLite database, you must first install and load the DBI package and the backend package RSQLite:

library(DBI)
library(RSQLite)

Then you define the database driver to be of type “SQLite” and set up a connection to the database, in this case by naming the file:

driver <- dbDriver("SQLite")
db_file <- system.file(
  "extdata",
  "crabtag.sqlite",
  package = "learningr"
)
conn <- dbConnect(driver, db_file)

The equivalent for a MySQL database would be to load the RMySQL package and set the driver type to be “MySQL”:

driver <- dbDriver("MySQL")
db_file <- "path/to/MySQL/database"
conn <- dbConnect(driver, db_file)

The PostgreSQL, Oracle, and JDBC equivalents require the PostgreSQL, ROracle, and RJDBC packages, respectively, and their database names are the driver names, just as with SQLite and MySQL.

To retrieve data from the databases you write a query, which is just a string containing SQL commands, and send it to the database with dbGetQuery. In this next example, SELECT * FROM IdBlock means “get every column of the table named IdBlock”:[40]

query <- "SELECT * FROM IdBlock"
(id_block <- dbGetQuery(conn, query))
##   Tag ID Firmware Version No Firmware Build Level
## 1 A03401                   2                   70

Then, after you’ve finished manipulating the database, you need to clean up by disconnecting and unloading the driver:

dbDisconnect(conn)
## [1] TRUE
dbUnloadDriver(driver)
## [1] TRUE

It is very easy to accidentally leave connections open, especially if an error occurs while you are connected. One way to avoid this is to wrap your database code into a function, and use on.exit to make sure that the cleanup code always runs. on.exit runs R code whenever its parent function exits, whether it finishes correctly or throws an error. We can rewrite the previous example with safer code as follows:

query_crab_tag_db <- function(query)
{
  driver <- dbDriver("SQLite")
  db_file <- system.file(
    "extdata",
    "crabtag.sqlite",
    package = "learningr"
  )
  conn <- dbConnect(driver, db_file)
  on.exit(
    {
      #this code block runs at the end of the function,
      #even if an error is thrown
      dbDisconnect(conn)
      dbUnloadDriver(driver)
    }
  )
  dbGetQuery(conn, query)
}

We can pass any SQL code to the function to query the crab tag database:

query_crab_tag_db("SELECT * FROM IdBlock")
##   Tag ID Firmware Version No Firmware Build Level
## 1 A03401                   2                   70

In this case, the DBI package provides a convenience function that saves us having to write our own SQL code. dbReadTable does as you might expect: it reads a table from the connected database (use dbListTables(conn) if you can’t remember the name of the table that you want):

dbReadTable(conn, "idblock")
##   Tag.ID Firmware.Version.No Firmware.Build.Level
## 1 A03401                   2                   70
## [1] TRUE
## [1] TRUE

If your database isn’t one of the types listed here, the RODBC package is an alternative that uses ODBC database connections—this is particularly useful for connecting to SQL Server or Access databases. The functions have different names than their DBI counterparts, but the principles are very similar. You need to set up an ODBC data source on your machine (via the Control Panel under Windows; search for “ODBC” in the Start Menu) before R can connect. Figure 12-1 shows the OBDC registration wizard from Windows 7.

Registering an ODBC data source with the ODBC Data Source Administrator

Figure 12-1. Registering an ODBC data source with the ODBC Data Source Administrator

Then call odbcConnect to connect, sqlQuery to run a query, and odbcClose to clean up afterward:

library(RODBC)
conn <- odbcConnect("my data source name")
id_block <- sqlQuery(conn, "SELECT * FROM IdBlock")
odbcClose(conn)

Methods for accessing NoSQL databases (short for “Not only SQL”; lightweight data stores that are more easily scalable than traditional SQL relational databases) from R are less mature. MongoDB can be accessed via the RMongo or rmongodb packages, Cassandra can be accessed from the RCassandra package, and CouchDB can be accessed from the R4CouchDB package [not yet on CRAN but available on GitHub].

Summary

  • Datasets supplied with R or packages can be made available with the data function.
  • You can import data into R from a very wide range of external sources.
  • read.table and its variants read rectangular data.
  • readLines reads text files with nonstandard structures.
  • You can read HTML and XML data with the XML package.
  • The RJSONIO, rjson, and yaml packages read JSON/YAML.
  • There are lots of packages for reading Excel files, including xlsx.
  • The foreign package reads data from other statistics software.
  • There are lots of packages for manipulating databases, including DBI and RODBC.

Test Your Knowledge: Quiz

Question 12-1
How do you find all the datasets built into R and the packages on your machine?
Question 12-2
What is the difference between the read.csv and read.csv2 functions?
Question 12-3
How would you import data from an Excel spreadsheet into R?
Question 12-4
How would you import data from a CSV file found on the Internet?
Question 12-5
The DBI package provides a consistent interface to several database management systems. Which ones are supported?

Test Your Knowledge: Exercises

Exercise 12-1
In the extdata folder of the learningr package, there is a file named hafu.csv, containing data on mixed-race manga characters. Import the data into a data frame. [5]
Exercise 12-2
Also in the extdata folder of learningr is an Excel file named multi-drug-resistant gonorrhoea infection.xls. Import the data from the first (and only) sheet into a data frame. Hint: this is a little easier if you view the file in a spreadsheet program first. LibreOffice is free and accomplishes this task easily. [10]
Exercise 12-3
From the crab tag SQLite database described in this chapter, import the contents of the Daylog table into a data frame. [10]


[37] Or periods, if you are American.

[38] Specifically, migrating from the eastern North Sea near Germany to the western North Sea near the UK.

[39] There isn’t currently a method of importing the older SAS SD2 files into R. The easiest way of dealing with a file in this format is to open it with the free SAS Universal Viewer and resave it as CSV.

[40] Any database administrators worth their salt will tell you that writing SELECT * is lazy, and you should jolly well explicitly name your columns before that code goes anywhere near their servers, thank you very much.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset