There are many websites from which you can download datasets and to analyze with R. A few sources are presented in the list that follows, as examples of the vast universe of shared data. In many cases, it is necessary to register to use the datasets and/or agree to terms of use. Carefully read the requirements of any provider from whom you plan to acquire data. Datasets from the following sources are frequently offered in Excel or CSV format, which have already been discussed in the section “Reading from an External File”; some examples in other formats follow:
> install.packages("Quandl") > library(Quandl)
Browse the Quandl web page until you find a file that you want. For example, suppose that you chose the FBI “Crimes by State” file for Pennsylvania at http://www.quandl.com/FBI_UCR/USCRIME_STATE_PENNSYLVANIA. You can load it into an R data frame, penn.crime
, with one command:
> penn.crime = Quandl("FBI_UCR/USCRIME_STATE_PENNSYLVANIA")
R can read data in many different formats. Importing data from some of the most important ones is discussed in this section.
Our first example is a simple CSV file from the National Science Foundation. Note that it looks very much like the example in the section “Reading from an External File”; however, because this file is not in a working directory on your computer, you must include the entire URL in quotes—identifying the web page from which it comes—as shown here:
> nsf2011 = read.csv( "http://www.nsf.gov/statistics/ffrdc/data/exp2011.csv", header=TRUE)
I found an interesting dataset at the Association of Religion Data Archives (http://www.thearda.com). After reading about ARDA, click Data Archive on the Menu bar at the top of the page to see what datasets are available. Datasets come in many different formats. As an example, you can download “The Gravestone Index,” collected by Wilbur Zelinsky, at http://www.thearda.com/Archive/Files/Downloads/CEMFILE_DL.asp in any of three versions. Two formats, SPSS and Stata, were designed for rival statistical software packages. An R package called foreign
can translate either of these formats into an R data frame. Here’s how to install and load the package:
> install.packages("foreign", dependencies=TRUE) > library(foreign)
After downloading the SPSS file to your working directory, you can read it into an R data frame named stone
by using the following commands:
> stone=read.spss("The Gravestone Index.SAV",to.data.frame=TRUE) > fix(stone) # look at the data in the editor
The foreign
package can also read and write other data formats, such as Minitab, SAS, Octave, and Systat. You can learn more about the foreign
package by using this command:
> library(help=foreign)
The Gravestone file is also available as an ASCII file with fixed-width format. This means that the data falls into fixed positions on a line, without a space or other separator between data points. The first few lines and last few lines look like this:
11862 8 1182000000000000000000000000000000000 11868 8 1182000000000010000000000000000000000 11875 8 1182000000000000000000000000000000000 11910 8 1182000000000000000000000000000000000 11885 8 1182000000000000000000000000000000000 11861 8 1182000000000010000000000000000000000 11864 8 1182000000000010000000000000000000000 52003 18 64120000 0 110 0 00000001 0 0000000000 0 52003 18 64120000 0 1 0 0 00000000 0 0000000000 0 52007 18 64120000 0 0 0 0 0000000010 0010000000 0 52003 18 64120000 0 0 0 0 00000000 0 0000000000 0 51990 18 64120000 0 0 0 0 00000000 0 0000000000 0
The first part of the codebook looks like this:
1) BOOKNUM: 1 2) YBIRTH: 2-5 3) CEMNAME: 6-8 4) YEAREST: 9-10 5) CITYCEM: 11-12 6) COUNTRY: 13 7) COLLYEAR: 14 8) GOTHICW: 15 9) MARRIAG: 16 10) HEART: 17 11) HEARTSS: 18 12) MILITAR: 19 13) SECMESS: 20 14) OCCUPAT: 21
You can read the data by using the read.fwf()
(read fixed-width format) command. Notice that there is no header information. Including the header=TRUE
argument would give misleading information to R, which would try to assign variable names according to the numbers in the first row. This would result in an error message. It will be necessary to include the widths
argument, followed by a vector giving the column widths of each of the variables, as indicated in the codebook. The first variable, BOOKNUM
, is one column; the second variable, YBIRTH
, four columns (2 through 5); the third variable, CEMNAME
, is three columns (6 through 8); and so on. The following command reads the ASCII file, which has been copied to the working directory:
gs = read.fwf("The Gravestone Index.DAT",widths = c(1,4,3,2,2,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1))
Alternatively, instead of typing 42 1s, you can use the rep()
function to accomplish the same thing:
> gs = read.fwf("The Gravestone Index.DAT", widths = c(1,4,3,2,2,rep(1,42)))
The SPSS datafile of this same data had variable names, but the ASCII file comes without names and the variables are assigned names of V1
, V2
, V3
, and so on. We can give the variables real names by creating a new vector with the names from the codebook:
vars = c("BOOKNUM", "YBIRTH", "CEMNAME", "YEAREST", "CITYCEM", "COUNTRY", "COLLYEAR", "GOTHICW", "MARRIAG", "HEART", "HEARTSS", "MILITAR", "SECMESS", "OCCUPAT", "PICTORIA", "DECEAEL", "PHOTODEC", "RELMES", "SYMBOL", "ANGEL", "SYMBOOK", "SYMDEATH", "DOVE", "FISH", "FINGERS", "SYMDIVIN", "GATES", "HANDSIP", "IHSE", "HANDS", "LAMB", "SYMCROSS", "STATUE", "STAANGEL", "STABOOK", "STADIVIN", "STALAMB", "STACROSS", "EFFIGY", "WEEPWIL", "SECULAR", "SYMCHURC", "HANDSCIP", "CROWN", "STADOVE", "PICCHURC", "STADEATH")
Then we can give the variables of gs
the names in the vars
vector:
> names(gs) = vars
XML (Extensible Markup Language) is a text format used for exchange of data. Because there are so many different formats for data, some of which are proprietary or even secret, it becomes virtually impossible to translate every format to every other one. XML, which is transparent and open, is a common means for sharing data among different computer systems and applications. There is an XML
package for R, making it possible for R users to read and create XML documents. You can find the documentation for this package at http://www.omegahat.org/RSXML/. XML files can be considerably more complex than the simple flat files we have looked at so far. There will usually be some exploration of the XML file required—to learn its structure—before converting it to an R data frame. Following is an example of converting a relatively simple XML file to a data frame. This is the Federal Election Commission 2009–10 Candidate Summary File, which you can find at http://catalog.data.gov/dataset/2009-2010-candidate-summary-file.
You can do the conversion after you install and load the XML
package:
> install.packages("XML",dependencies=TRUE) > library(XML) > cand = xmlToDataFrame("CandidateSummaryAction.xml")
You can find the following dataset in the data repositories list on the National Snow and Ice Data Center (http://nsidc.org). I have chosen it to demonstrate another data type, the netCDF (Network Common Data Form) file. This format has become popular for storing large arrays of scientific data, especially geophysical data. Like XML, datasets in this format can be complex. Download the dataset by FTP from http://bit.ly/1jO6Ir9 and and save it your to your working directory. Install and load the ncdf
package to work with this data in R:
> install.packages("ncdf") > library(ncdf)
This dataset is a rather complex list of objects, each of which is itself a list of objects. In netCDF parlance, each of the main lists is a “variable.” To use the data in R, it is necessary make a subset of the data that will include the list of items associated with one variable. You can accomplish this as follows:
> ice = open.ncdf("seaice_conc_monthly_nh_f08_198707_v02r00.nc") > # creates an R object named "ice" > str(ice) # shows that ice is a list comprised of other lists > icedata = get.var.ncdf(ice,"seaice_conc_monthly_cdr") > close.ncdf(ice)
The names of the variables were discovered in the results of the str(ice)
command, and seaice_conc_monthly_cdr
was selected for the sake of this example. In most cases, you will need to know more about the data in order to select a variable name.
It is also possible to copy data contained within web pages. This is commonly known as web scraping. A thorough discussion of the topic is beyond the scope of this book, but should you have a need to extract web data, a good place to start would be the help files for download.file()
and readLines()
. There are some packages that might be useful, such as RCurl
, XML
, and several others.