Data can come from many sources. R comes with many datasets built in, and there is more data in many of the add-on packages. R can read data from a wide variety of sources and in a wide variety of formats. This chapter covers importing data from text files (including spreadsheet-like data in comma- or tab-delimited format, XML, and JSON), binary files (Excel spreadsheets and data from other analysis software), websites, and databases.
After reading this chapter, you should:
One of the packages in the base R distribution is called datasets
, and it is entirely filled with example datasets. While you’ll be lucky if any of them are suited to your particular area of research, they are ideal for testing your code and for exploring new techniques. Many other packages also contain datasets. You can see all the datasets that are available in the packages that you have loaded using the data
function:
data()
For a more complete list, including data from all packages that have been installed, use this invocation:
data(
package=
.
packages(
TRUE
))
To access the data in any of these datasets, call the data
function, this time passing the name of the dataset and the package where the data is found (if the package has been loaded, then you can omit the package
argument):
data(
"kidney"
,
package=
"survival"
)
Now the kidney
data frame can be used just like your own variables:
head(
kidney)
## id time status age sex disease frail ## 1 1 8 1 28 1 Other 2.3 ## 2 1 16 1 28 1 Other 2.3 ## 3 2 23 1 48 2 GN 1.9 ## 4 2 13 0 48 2 GN 1.9 ## 5 3 22 1 32 1 Other 1.2 ## 6 3 28 1 32 1 Other 1.2
There are many, many formats and standards of text documents for storing data. Common formats for storing data are delimiter-separated values (CSV or tab-delimited), eXtensible Markup Language (XML), JavaScript Object Notation (JSON), and YAML (which recursively stands for YAML Ain’t Markup Language). Other sources of text data are less well-structured—a book, for example, contains text data without any formal (that is, standardized and machine parsable) structure.
The main advantage of storing data in text files is that they can be read by more or less all other data analysis software and by humans. This makes your data more widely reusable by others.
Rectangular (spreadsheet-like) data is commonly stored in delimited-value files, particularly comma-separated values (CSV) and tab-delimited values files. The read.table
function reads these delimited files and stores the results in a data frame. In its simplest form, it just takes the path to a text file and imports the contents.
RedDeerEndocranialVolume.dlm is a whitespace-delimited file containing measurements of the endocranial volume of some red deer, measured using different techniques. (For those of you with an interest in deer skulls, the methods are computer tomography; filling the skull with glass beads; measuring the length, width, and height with calipers; and using Finarelli’s equation. A second measurement was taken in some cases to get an idea of the accuracy of the techniques. I’ve been assured that the deer were already long dead before they had their skulls filled with beads!) The data file can be found inside the extdata folder in the learningr
package. The first few rows are shown in Table 12-1.
Table 12-1. Sample data from RedDeerEndocranialVolume.dlm
SkullID | VolCT | VolBead | VolLWH | VolFinarelli | VolCT2 | VolBead2 | VolLWH2 |
---|---|---|---|---|---|---|---|
DIC44 | 389 | 375 | 1484 | 337 | |||
B11 | 389 | 370 | 1722 | 377 | |||
DIC90 | 352 | 345 | 1495 | 328 | |||
DIC83 | 388 | 370 | 1683 | 377 | |||
DIC787 | 375 | 355 | 1458 | 328 | |||
DIC1573 | 325 | 320 | 1363 | 291 | |||
C120 | 346 | 335 | 1250 | 289 | 346 | 330 | 1264 |
C25 | 302 | 295 | 1011 | 250 | 303 | 295 | 1009 |
F7 | 379 | 360 | 1621 | 347 | 375 | 365 | 1647 |
The data has a header row, so we need to pass the argument header = TRUE
to read.table
. Since a second measurement wasn’t always taken, not all the lines are complete. Passing the argument fill = TRUE
makes read.table
substitute NA
values for the missing fields. The system.file
function in the following example is used to locate files that are inside a package (in this case, the RedDeerEndocranialVolume.dlm file in the extdata folder of the package learningr
).
library(
learningr)
deer_file<-
system.file(
"extdata"
,
"RedDeerEndocranialVolume.dlm"
,
package=
"learningr"
)
deer_data<-
read.table(
deer_file,
header=
TRUE
,
fill=
TRUE
)
str(
deer_data,
vec.len=
1
)
#vec.len alters the amount of output
## 'data.frame': 33 obs. of 8 variables: ## $ SkullID : Factor w/ 33 levels "A4","B11","B12",..: 14 ... ## $ VolCT : int 389 389 ... ## $ VolBead : int 375 370 ... ## $ VolLWH : int 1484 1722 ... ## $ VolFinarelli: int 337 377 ... ## $ VolCT2 : int NA NA ... ## $ VolBead2 : int NA NA ... ## $ VolLWH2 : int NA NA ...
head(
deer_data)
## SkullID VolCT VolBead VolLWH VolFinarelli VolCT2 VolBead2 VolLWH2 ## 1 DIC44 389 375 1484 337 NA NA NA ## 2 B11 389 370 1722 377 NA NA NA ## 3 DIC90 352 345 1495 328 NA NA NA ## 4 DIC83 388 370 1683 377 NA NA NA ## 5 DIC787 375 355 1458 328 NA NA NA ## 6 DIC1573 325 320 1363 291 NA NA NA
Notice that the class of each column has been automatically determined, and row and column names have been automatically assigned. The column names are (by default) forced to be valid variable names (via make.names
), and if row names aren’t provided the rows are simply numbered 1, 2, 3, and so on.
There are lots of arguments to specify how the file will be read; perhaps the most important is sep
, which determines the character to use as a separator between fields. You can also specify how many lines of data to read via nrow
, and how many lines at the start of the file to skip
. More advanced options include the ability to override the default row names, column names, and classes, and to specify the character encoding of the input file and how string input columns should be declared.
There are several convenience wrapper functions to read.table
. read.csv
sets the default separator to a comma, and assumes that the data has a header row. read.csv2
is its European cousin, using a comma for decimal places and a semicolon as a separator. Likewise read.delim
and read.delim2
import tab-delimited files with full stops[37] or commas for decimal places, respectively.
Back in August 2008, scientists from the Centre for Environment, Fisheries, and Aquaculture Science (CEFAS) in Lowestoft, UK, attached a tag with a pressure sensor and a temperature sensor to a brown crab and dropped it into the North Sea. The crab then spent just over a year doing crabby things[38] before being caught by fishermen, who returned the tag to CEFAS.
The data from this tag is stored in a CSV file, along with some metadata. The first few rows of the file look like this:
Comment :- clock reset to download data | |
The following data are the ID block contents | |
Firmware Version No | 2 |
Firmware Build Level | 70 |
The following data are the Tag notebook contents | |
Mission Day | 405 |
Last Deployment Date | 08/08/2008 09:55:00 |
Deployed by Host Version | 5.2.0 |
Downloaded by Host Version | 6.0.0 |
Last Clock Set Date | 05/01/2010 10:34:00 |
The following data are the Lifetime notebook contents | |
Tag ID | A03401 |
Pressure Range | 10 |
No of sensors | 2 |
In this case, we can’t just read everything with a single call to read.csv
, since different pieces of data have different numbers of fields, and indeed different fields. We need to use the skip
and nrow
arguments of read.csv
to specify which bits of the file to read:
crab_file<-
system.file(
"extdata"
,
"crabtag.csv"
,
package=
"learningr"
)
(
crab_id_block<-
read.csv(
crab_file,
header=
FALSE
,
skip=
3
,
nrow=
2
))
## V1 V2 ## 1 Firmware Version No 2 ## 2 Firmware Build Level 70
(
crab_tag_notebook<-
read.csv(
crab_file,
header=
FALSE
,
skip=
8
,
nrow=
5
))
## V1 V2 ## 1 Mission Day 405 ## 2 Last Deployment Date 08/08/2008 09:55:00 ## 3 Deployed by Host Version 5.2.0 ## 4 Downloaded by Host Version 6.0.0 ## 5 Last Clock Set Date 05/01/2010 10:34:00
(
crab_lifetime_notebook<-
read.csv(
crab_file,
header=
FALSE
,
skip=
15
,
nrow=
3
))
## V1 V2 ## 1 Tag ID A03401 ## 2 Pressure Range 10 ## 3 No of sensors 2
The colbycol
and sqldf
packages contain functions that allow you to read part of a CSV file into R. This can provide a useful speed-up if you don’t need all the columns or all the rows.
For really low-level control when importing this sort of file, you can use the scan
function, on which read.table
is based. Ordinarily, you should never have to resort to scan
, but it can be useful for malformed or nonstandard files.
If your data has been exported from another language, you may need to pass the na.strings
argument to read.table
.
For data exported from SQL, use na.strings = "NULL"
.
For data exported from SAS or Stata, use na.strings = "."
.
For data exported from Excel, use na.strings = c("", "#N/A", "#DIV/0!", "#NUM!")
.
The opposite task, writing files, is generally simpler than reading files, since you don’t need to worry about oddities in the file—you usually want to create something standard. read.table
and read.csv
have the obviously named counterparts write.table
and write.csv
.
Both functions take a data frame and a file path to write to. They also provide a few options to customize the output (whether or not to include row names and what the character encoding of the output file should be, for example):
write.csv(
crab_lifetime_notebook,
"Data/Cleaned/crab lifetime data.csv"
,
row.names=
FALSE
,
fileEncoding=
"utf8"
)
Not all text files have a well-defined structure like delimited files do. If the structure of the file is weak, it is often easier to read in the file as lines of text and then parse or manipulate the contents afterward. readLines
(notice the capital “L”) provides such a facility. It accepts a path to a file (or a file connection) and, optionally, a maximum number of lines to read. Here we import the Project Gutenberg version of Shakespeare’s The Tempest:
text_file<-
system.file(
"extdata"
,
"Shakespeare's The Tempest, from Project Gutenberg pg2235.txt"
,
package=
"learningr"
)
the_tempest<-
readLines(
text_file)
the_tempest[
1926
:1927
]
## [1] " Ste. Foure legges and two voyces; a most delicate" ## [2] "Monster: his forward voyce now is to speake well of"
writeLines
performs the reverse operation to readLines
. It takes a character vector and a file to write to:
writeLines(
rev(
text_file),
#rev reverses vectors
"Shakespeare's The Tempest, backwards.txt"
)
XML files are widely used for storing nested data. Many standard file types and protocols are based upon it, such as RSS (Really Simple Syndication) for news feeds, SOAP (Simple Object Access Protocol) for passing structured data across computer networks, and the XHTML flavor of web pages.
Base R doesn’t ship with the capability to read XML files, but the XML
package is developed by an R Core member. Install it now!
install.packages(
"XML"
)
When you import an XML file, the XML package gives you a choice of storing the result using internal nodes (that is, objects are stored with C code, the default) or R nodes. Usually you want to store things with internal nodes, because it allows you to query the node tree using XPath (more on this in a moment).
There are several functions for importing XML data, such as xmlParse
and some other wrapper functions that use slightly different defaults:
library(
XML)
xml_file<-
system.file(
"extdata"
,
"options.xml"
,
package=
"learningr"
)
r_options<-
xmlParse(
xml_file)
One of the problems with using internal nodes is that summary functions like str
and head
don’t work with them. To use R-level nodes, set useInternalNodes = FALSE
(or use xmlTreeParse
, which sets this as the default):
xmlParse(
xml_file,
useInternalNodes=
FALSE
)
xmlTreeParse(
xml_file)
#the same
XPath is a language for interrogating XML documents, letting you find nodes that correspond to some filter. In this next example we look anywhere in the document (//
) for a node named variable
where ([]
) the name
attribute (@
) contains
the string warn
.
xpathSApply(
r_options,
"//variable[contains(@name, 'warn')]"
)
## [[1]] ## <variable name="nwarnings" type="numeric"> ## <value>50</value> ## </variable> ## ## [[2]] ## <variable name="warn" type="numeric"> ## <value>0</value> ## </variable> ## ## [[3]] ## <variable name="warning_length" type="numeric"> ## <value>1000</value> ## </variable>
This sort of querying is very useful for extracting data from web pages. The equivalent functions for importing HTML pages are named as you might expect, htmlParse
and htmlTreeParse
, and they behave in the same way.
XML is also very useful for serializing (a.k.a. “saving”) objects in a format that can be read by most other pieces of software. The XML package doesn’t provide serialization functionality, but it is available via the makexml
function in the Runiversal
package. The options.xml file was created with this code:
library(
Runiversal)
ops<-
as.list(
options())
cat(
makexml(
ops),
file=
"options.xml"
)
The main problems with XML are that it is very verbose, and you need to explicitly specify the type of the data (it can’t tell the difference between a string and a number by default), which makes it even more verbose. When file sizes are important (such as when you are transferring big datasets across a network), this verbosity becomes problematic.
YAML and its subset JSON were invented to solve these problems. They are much better suited to transporting many datasets—particularly numeric data and arrays—over networks. JSON is the de facto standard for web applications to pass data between themselves.
There are two packages for dealing with JSON data: RJSONIO
and rjson
. For a long time rjson
had performance problems, so the only package that could be recommended was RJSONIO
. The performance issues have now been fixed, so it’s a much closer call. For most cases, it now doesn’t matter which package you use. The differences occur when you encounter malformed or nonstandard JSON.
RJSONIO
is generally more forgiving than rjson
when reading incorrect JSON. Whether this is a good thing or not depends upon your use case. If you think it is better to import JSON data with minimal fuss, then RJSONIO
is best. If you want to be alerted to problems with the JSON data (perhaps it was generated by a colleague—I’m sure you would never generate malformed JSON), then rjson
is best.
Fortunately, both packages have identically named functions for reading and writing JSON data, so it is easy to swap between them. In the following example, the double colons, ::
, are used to distinguish which package each function should be taken from (if you only load one of the two packages, then you don’t need the double colons):
library(
RJSONIO)
library(
rjson)
jamaican_city_file<-
system.file(
"extdata"
,
"Jamaican Cities.json"
,
package=
"learningr"
)
(
jamaican_cities_RJSONIO<-
RJSONIO::
fromJSON(
jamaican_city_file))
## $Kingston ## $Kingston$population ## [1] 587798 ## ## $Kingston$coordinates ## longitude latitude ## 17.98 76.80 ## ## ## $`Montego Bay` ## $`Montego Bay`$population ## [1] 96488 ## ## $`Montego Bay`$coordinates ## longitude latitude ## 18.47 77.92
(
jamaican_cities_rjson<-
rjson::
fromJSON(
file=
jamaican_city_file))
## $Kingston ## $Kingston$population ## [1] 587798 ## ## $Kingston$coordinates ## $Kingston$coordinates$longitude ## [1] 17.98 ## ## $Kingston$coordinates$latitude ## [1] 76.8 ## ## ## ## $`Montego Bay` ## $`Montego Bay`$population ## [1] 96488 ## ## $`Montego Bay`$coordinates ## $`Montego Bay`$coordinates$longitude ## [1] 18.47 ## ## $`Montego Bay`$coordinates$latitude ## [1] 77.92
Notice that RJSONIO
simplifies the coordinates for each city to be a vector. This behavior can be turned off with simplify = FALSE
, resulting in exactly the same object as the one generated by rjson
.
Annoyingly, the JSON spec doesn’t allow infinite or NaN
values, and it’s a little fuzzy on what a missing number should look like. The two packages deal with these values differently—RJSONIO
maps NaN
and NA
to JSON’s null
value but preserves positive and negative infinity, while rjson
converts all these values to strings:
special_numbers<-
c(
NaN
,
NA
,
Inf,
-
Inf)
RJSONIO::
toJSON(
special_numbers)
## [1] "[ null, null, Inf, -Inf ]"
rjson::
toJSON(
special_numbers)
## [1] "["NaN","NA","Inf","-Inf"]"
Since both these methods are hacks to deal with JSON’s limited spec, if you find yourself dealing with these special number types a lot (or want to write comments in your data object), then you are better off using YAML. The yaml
package has two functions for importing YAML data. yaml.load
accepts a string of YAML and converts it to an R object, and yaml.load_file
does the same, but treats its string input as a path to a file containing YAML:
library(
yaml)
yaml.load_file(
jamaican_city_file)
## $Kingston ## $Kingston$population ## [1] 587798 ## ## $Kingston$coordinates ## $Kingston$coordinates$longitude ## [1] 17.98 ## ## $Kingston$coordinates$latitude ## [1] 76.8 ## ## ## ## $`Montego Bay` ## $`Montego Bay`$population ## [1] 96488 ## ## $`Montego Bay`$coordinates ## $`Montego Bay`$coordinates$longitude ## [1] 18.47 ## ## $`Montego Bay`$coordinates$latitude ## [1] 77.92
as.yaml
performs the opposite task, converting R objects to YAML strings.
Many pieces of software store their data in binary formats (some proprietary, some conforming to publically defined standards). The binary formats are often smaller than their text equivalents, so performance gains are usually possible by using a binary format, at the expense of human readability.
Many binary file formats are proprietary, which goes against free software principles. If you have the option, it is usually best to avoid such formats in order to stop your data being locked into a platform over which you lack control.
Microsoft Excel is the world’s most popular spreadsheet package, and very possibly the world’s most popular data analysis tool. Unfortunately, its document formats, XLS and XLSX, don’t play very nicely with other software, especially outside of a Windows environment. This means that some experimenting may be required to find a setup that works for your choice of operating system and the particular type of Excel file.
That said , the xlsx
package is Java-based and cross-platform, so at least in theory it can read any Excel file on any system. It provides a choice of functions for reading Excel files: spreadsheets can be imported with read.xlsx
and read.xlsx2
, which do more processing in R and in Java, respectively. The choice of two engines is rather superfluous; you want read.xlsx2
, since it’s faster and the underlying Java library is more mature than the R code.
The next example features the fastest times from the Alpe d’Huez mountain stage of the Tour de France bike race, along with whether or not each cyclist has been found guilty of using banned performance-enhancing drugs. The colClasses
argument determines what class each column should have in the resulting data frame. It isn’t compulsory, but it saves you having to manipulate the resulting data afterward:
library(
xlsx)
bike_file<-
system.file(
"extdata"
,
"Alpe d'Huez.xls"
,
package=
"learningr"
)
bike_data<-
read.xlsx2(
bike_file,
sheetIndex=
1
,
startRow=
2
,
endRow=
38
,
colIndex=
2
:8
,
colClasses=
c(
"character"
,
"numeric"
,
"character"
,
"integer"
,
"character"
,
"character"
,
"character"
)
)
head(
bike_data)
## Time NumericTime Name Year Nationality DrugUse ## 1 37' 35" 37.58 Marco Pantani 1997 Italy Y ## 2 37' 36" 37.60 Lance Armstrong 2004 United States Y ## 3 38' 00" 38.00 Marco Pantani 1994 Italy Y ## 4 38' 01" 38.02 Lance Armstrong 2001 United States Y ## 5 38' 04" 38.07 Marco Pantani 1995 Italy Y ## 6 38' 23" 38.38 Jan Ullrich 1997 Germany Y ## Allegations ## 1 Alleged drug use during 1997 due to high haematocrit levels. ## 2 2004 Tour de France title stripped by USADA in 2012. ## 3 Alleged drug use during 1994 due to high haematocrit levels. ## 4 2001 Tour de France title stripped by USADA in 2012. ## 5 Alleged drug use during 1995 due to high haematocrit levels. ## 6 Found guilty of a doping offense by the CAS.
The xlsReadWrite
package provides an alternative to xlsx
, but it currently only works with 32-bit R installations, and only on Windows. There are also some other packages designed to work with Excel. RExcel
and excel.link
use COM connections to control Excel from R, and WriteXLS
uses Perl to write to Excel files. The gnumeric
package provides functions for reading Gnumeric spreadsheets.
The counterpart to read.xlsx2
is (you guessed it) write.xlsx2
. It works the same way as write.csv
, taking a data frame and a filename. Unless you really need to use Excel spreadsheets, you are better off saving your data in a text format for portability, so use this with caution.
If you are collaborating with statisticians in other organizations, they may try to send you files from another statistical package. The foreign
package contains methods to read SAS permanent datasets[39] (SAS7BDAT files) using read.ssd
, Stata DTA files with read.dta
, and SPSS data files with read.spss
. Each of these files can be written with write.foreign
.
MATLAB binary data files (Level 4 and Level 5) can be read and written using readMat
and writeMat
in the R.matlab
package.
R can read data from many other types of files.
It can read Hierarchical Data Format v5 [HDF5] files via the h5r
package (and the rdhf5
package on Bioconductor), and Network Common Data Format [NetCDF] files via the ncdf
package.
It can read ESRI ArcGIS spatial data files via the maptools
and shapefiles
packages (and older ArcInfo files via the RArcInfo
package).
It can read raster picture formats via the jpeg
, png
, tiff
, rtiff
, and readbitmap
packages.
It can read a variety of genomics data formats using packages on Bioconductor. Most notably, it can read microarray data in GenePix GPR files (a.k.a. axon text files) via the RPPanalyzer
package; Variant Call Format (VCF) files for gene sequence variants via the vcf2geno
package; binary sequence alignment data via rbamtools
(which provides an interface to SAMtools); and Luminex bead array assay files via the lxb
package.
Finally, there are loads of miscellaneous formats scattered in other packages. A nonexhaustive list includes the 4dfp
and tractor.base
packages for MRI images, the IgorR
package for WaveMetrics Igor binary format files, the GENEAread
package for GENEActiv watch accelerometer data, the emme2
package for Inro Software EMME v2 databank files, the SEER2R
package for SEER cancer datasets, the rprotobuf
package for Google’s protocol buffers, the readBrukerFlexData
package for mass spectrometry data in Bruker flex format, the M3
package for community multiscale air quality models in Models-3 files, and the Read.isi
package for ISI codebooks from the World Fertility Survey.
While most of these packages are completely useless to most people, the fact that R can access so many niches in so many different areas is rather astounding.
The Internet contains a wealth of data, but it’s hard work (and not very scalable) to manually visit a website, download a data file, and then read it into R from your hard drive.
Fortunately, R has a variety of ways to import data from web sources; retrieving the data programmatically makes it possible to collect much more of it with much less effort.
Several packages exist that download data directly into R using a website’s application programming interface (API). For example, the World Bank makes its World Development Indicators data publically available, and the WDI
package lets you easily import the data without leaving R. To run the next example, you first need to install the WDI
package:
install.packages(
"WDI"
)
library(
WDI)
#list all available datasets
wdi_datasets<-
WDIsearch()
head(
wdi_datasets)
## indicator ## [1,] "BG.GSR.NFSV.GD.ZS" ## [2,] "BM.KLT.DINV.GD.ZS" ## [3,] "BN.CAB.XOKA.GD.ZS" ## [4,] "BN.CUR.GDPM.ZS" ## [5,] "BN.GSR.FCTY.CD.ZS" ## [6,] "BN.KLT.DINV.CD.ZS" ## name ## [1,] "Trade in services (% of GDP)" ## [2,] "Foreign direct investment, net outflows (% of GDP)" ## [3,] "Current account balance (% of GDP)" ## [4,] "Current account balance excluding net official capital grants (% of GDP)" ## [5,] "Net income (% of GDP)" ## [6,] "Foreign direct investment (% of GDP)"
#retrieve one of them
wdi_trade_in_services<-
WDI(
indicator=
"BG.GSR.NFSV.GD.ZS"
)
str(
wdi_trade_in_services)
## 'data.frame': 984 obs. of 4 variables: ## $ iso2c : chr "1A" "1A" "1A" ... ## $ country : chr "Arab World" "Arab World" "Arab World" ... ## $ BG.GSR.NFSV.GD.ZS: num 17.5 NA NA NA ... ## $ year : num 2005 2004 2003 2002 ...
The SmarterPoland
package provides a similar wrapper to Polish government data. quantmod
provides access to stock tickers (Yahoo!’s by default, though several other sources are available):
library(
quantmod)
#If you are using a version before 0.5.0 then set this option
#or pass auto.assign = FALSE to getSymbols.
options(
getSymbols.auto.assign=
FALSE
)
microsoft<-
getSymbols(
"MSFT"
)
head(
microsoft)
## MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume ## 2007-01-03 29.91 30.25 29.40 29.86 76935100 ## 2007-01-04 29.70 29.97 29.44 29.81 45774500 ## 2007-01-05 29.63 29.75 29.45 29.64 44607200 ## 2007-01-08 29.65 30.10 29.53 29.93 50220200 ## 2007-01-09 30.00 30.18 29.73 29.96 44636600 ## 2007-01-10 29.80 29.89 29.43 29.66 55017400 ## MSFT.Adjusted ## 2007-01-03 25.83 ## 2007-01-04 25.79 ## 2007-01-05 25.64 ## 2007-01-08 25.89 ## 2007-01-09 25.92 ## 2007-01-10 25.66
The twitteR
package provides access to Twitter’s users and their tweets. There’s a little bit of setup involved (due to Twitter’s API requiring you to create an application and register it using OAuth; read the vignette for the package for setup instructions), but after that the package makes it easy to import Twitter data for network analysis, or simply look at tweets while pretending to work.
R has its own web server built in, so some functions for reading data are Internet-enabled by default. read.table
(and its derivatives, like read.csv
) can accept a URL rather than a local file, and will download a copy to a temp file before importing the data. For example, economic researcher Justin Rao’s website has NBA basketball salary data for 2002 through 2008:
salary_url<-
"http://www.justinmrao.com/salary_data.csv"
salary_data<-
read.csv(
salary_url)
str(
salary_data)
Since accessing a large file over the Internet can be slow, if the file is going to be used often, a better strategy is to download the file using download.file
to create a local copy, and then import that:
salary_url<-
"http://www.justinmrao.com/salary_data.csv"
local_copy<-
"my local copy.csv"
download.file(
salary_url,
local_copy)
salary_data<-
read.csv(
local_copy)
More advanced access to web pages can be achieved through the RCurl
package, which provides access to the libcurl
network client interface library. This is particularly useful if your data is contained inside an HTML or XML page, rather than a standard data format (like CSV) that just happens to be on the Web.
The next example retrieves the current date and time in several time zones from the United States Naval Observatory Time Service Department web page. The function getURL
retrieves the page as a character string:
library(
RCurl)
time_url<-
"http://tycho.usno.navy.mil/cgi-bin/timer.pl"
time_page<-
getURL(
time_url)
cat(
time_page)
## <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final"//EN> ## <html> ## <body> ## <TITLE>What time is it?</TITLE> ## <H2> US Naval Observatory Master Clock Time</H2> <H3><PRE> ## <BR>Jul. 17, 20:43:37 UTC Universal Time ## <BR>Jul. 17, 04:43:37 PM EDT Eastern Time ## <BR>Jul. 17, 03:43:37 PM CDT Central Time ## <BR>Jul. 17, 02:43:37 PM MDT Mountain Time ## <BR>Jul. 17, 01:43:37 PM PDT Pacific Time ## <BR>Jul. 17, 12:43:37 PM AKDT Alaska Time ## <BR>Jul. 17, 10:43:37 AM HAST Hawaii-Aleutian Time ## </PRE></H3><P><A HREF="http://www.usno.navy.mil"> US Naval Observatory</A> ## ## </body></html>
The next step is almost always to parse the page using htmlParse
(or a related function) from the XML
package. That allows you to extract the useful nodes. In the next example, splitting on
(newlines) retrieves each time line, and splitting on
(tabs) retrieves time/time zone pairs:
library(
XML)
time_doc<-
htmlParse(
time_page)
pre<-
xpathSApply(
time_doc,
"//pre"
)[[
1
]]
values<-
strsplit(
xmlValue(
pre),
" "
)[[
1
]][
-1
]
strsplit(
values,
" +"
)
## [[1]] ## [1] "Jul. 17, 20:43:37 UTC" "Universal Time" ## ## [[2]] ## [1] "Jul. 17, 04:43:37 PM EDT" "Eastern Time" ## ## [[3]] ## [1] "Jul. 17, 03:43:37 PM CDT" "Central Time" ## ## [[4]] ## [1] "Jul. 17, 02:43:37 PM MDT" "Mountain Time" ## ## [[5]] ## [1] "Jul. 17, 01:43:37 PM PDT" "Pacific Time" ## ## [[6]] ## [1] "Jul. 17, 12:43:37 PM AKDT" "Alaska Time" ## ## [[7]] ## [1] "Jul. 17, 10:43:37 AM HAST" "Hawaii-Aleutian Time"
The httr
package is based on RCurl
and provides syntactic sugar to make some tasks go down easier. The httr
equivalent of RCurl
’s getURL
is GET
, and the content
function retrieves the page’s content, parsing it in the process. In the next example, we pass useInternalNodes = TRUE
to mimic the behavior of htmlParse
and repeat the action of the previous example:
library(
httr)
time_page<-
GET(
time_url)
time_doc<-
content(
page,
useInternalNodes=
TRUE
)
Where data has to be accessed by many people, it is often best stored in a relational database. There are many database management systems (DBMSs) for working with relational databases, and R can connect to all the common ones. The DBI
package provides a unified syntax for accessing several DBMSs—currently SQLite, MySQL/MariaDB, PostgreSQL, and Oracle are supported, as well as a wrapper to the Java Database Connectivity (JDBC) API. (Connections to SQL Server use a different system, as we’ll see below.)
To connect to an SQLite database, you must first install and load the DBI
package and the backend package RSQLite
:
library(
DBI)
library(
RSQLite)
Then you define the database driver to be of type “SQLite” and set up a connection to the database, in this case by naming the file:
driver<-
dbDriver(
"SQLite"
)
db_file<-
system.file(
"extdata"
,
"crabtag.sqlite"
,
package=
"learningr"
)
conn<-
dbConnect(
driver,
db_file)
The equivalent for a MySQL database would be to load the RMySQL
package and set the driver type to be “MySQL”:
driver<-
dbDriver(
"MySQL"
)
db_file<-
"path/to/MySQL/database"
conn<-
dbConnect(
driver,
db_file)
The PostgreSQL, Oracle, and JDBC equivalents require the PostgreSQL
, ROracle
, and RJDBC
packages, respectively, and their database names are the driver names, just as with SQLite and MySQL.
To retrieve data from the databases you write a query, which is just a string containing SQL commands, and send it to the database with dbGetQuery
. In this next example, SELECT * FROM IdBlock
means “get every column of the table named IdBlock”:[40]
query<-
"SELECT * FROM IdBlock"
(
id_block<-
dbGetQuery(
conn,
query))
## Tag ID Firmware Version No Firmware Build Level ## 1 A03401 2 70
Then, after you’ve finished manipulating the database, you need to clean up by disconnecting and unloading the driver:
dbDisconnect(
conn)
## [1] TRUE
dbUnloadDriver(
driver)
## [1] TRUE
It is very easy to accidentally leave connections open, especially if an error occurs while you are connected. One way to avoid this is to wrap your database code into a function, and use on.exit
to make sure that the cleanup code always runs. on.exit
runs R code whenever its parent function exits, whether it finishes correctly or throws an error. We can rewrite the previous example with safer code as follows:
query_crab_tag_db<-
function
(
query)
{
driver<-
dbDriver(
"SQLite"
)
db_file<-
system.file(
"extdata"
,
"crabtag.sqlite"
,
package=
"learningr"
)
conn<-
dbConnect(
driver,
db_file)
on.exit(
{
#this code block runs at the end of the function,
#even if an error is thrown
dbDisconnect(
conn)
dbUnloadDriver(
driver)
}
)
dbGetQuery(
conn,
query)
}
We can pass any SQL code to the function to query the crab tag database:
query_crab_tag_db(
"SELECT * FROM IdBlock"
)
## Tag ID Firmware Version No Firmware Build Level ## 1 A03401 2 70
In this case, the DBI
package provides a convenience function that saves us having to write our own SQL code. dbReadTable
does as you might expect: it reads a table from the connected database (use dbListTables(conn)
if you can’t remember the name of the table that you want):
dbReadTable(
conn,
"idblock"
)
## Tag.ID Firmware.Version.No Firmware.Build.Level ## 1 A03401 2 70
## [1] TRUE
## [1] TRUE
If your database isn’t one of the types listed here, the RODBC
package is an alternative that uses ODBC database connections—this is particularly useful for connecting to SQL Server or Access databases. The functions have different names than their DBI
counterparts, but the principles are very similar. You need to set up an ODBC data source on your machine (via the Control Panel under Windows; search for “ODBC” in the Start Menu) before R can connect. Figure 12-1 shows the OBDC registration wizard from Windows 7.
Then call odbcConnect
to connect, sqlQuery
to run a query, and odbcClose
to clean up afterward:
library(
RODBC)
conn<-
odbcConnect(
"my data source name"
)
id_block<-
sqlQuery(
conn,
"SELECT * FROM IdBlock"
)
odbcClose(
conn)
Methods for accessing NoSQL databases (short for “Not only SQL”; lightweight data stores that are more easily scalable than traditional SQL relational databases) from R are less mature. MongoDB can be accessed via the RMongo
or rmongodb
packages, Cassandra can be accessed from the RCassandra
package, and CouchDB can be accessed from the R4CouchDB
package [not yet on CRAN but available on GitHub].
data
function.
read.table
and its variants read rectangular data.
readLines
reads text files with nonstandard structures.
XML
package.
RJSONIO
, rjson
, and yaml
packages read JSON/YAML.
xlsx
.
foreign
package reads data from other statistics software.
DBI
and RODBC
.
read.csv
and read.csv2
functions?
DBI
package provides a consistent interface to several database management systems. Which ones are supported?
learningr
package, there is a file named hafu.csv, containing data on mixed-race manga characters. Import the data into a data frame. [5]
learningr
is an Excel file named multi-drug-resistant gonorrhoea infection.xls. Import the data from the first (and only) sheet into a data frame. Hint: this is a little easier if you view the file in a spreadsheet program first. LibreOffice is free and accomplishes this task easily. [10]
Daylog
table into a data frame. [10]
[37] Or periods, if you are American.
[38] Specifically, migrating from the eastern North Sea near Germany to the western North Sea near the UK.
[39] There isn’t currently a method of importing the older SAS SD2 files into R. The easiest way of dealing with a file in this format is to open it with the free SAS Universal Viewer and resave it as CSV.
[40] Any database administrators worth their salt will tell you that writing SELECT *
is lazy, and you should jolly well explicitly name your columns before that code goes anywhere near their servers, thank you very much.