Appendix E. Importing Data from Outside of R

Some Useful Internet Data Repositories

There are many websites from which you can download datasets and to analyze with R. A few sources are presented in the list that follows, as examples of the vast universe of shared data. In many cases, it is necessary to register to use the datasets and/or agree to terms of use. Carefully read the requirements of any provider from whom you plan to acquire data. Datasets from the following sources are frequently offered in Excel or CSV format, which have already been discussed in the section “Reading from an External File”; some examples in other formats follow:

Open Access Directory (http://oad.simmons.edu/oadwiki/Main_Page)
This site provides links to downloadable data from many sources on diverse subjects, especially the sciences. Many of the datasets are free; some you must purchase. Scroll down the table of contents to “Data repositories” to see the variety of topics covered. Scroll down this page to “Social sciences” and choose from the listing:
FedStats (http://fedstats.sites.usa.gov/)
This is a repository of many kinds of United States federal government data freely available to the public. This page has links to various government agencies sharing data.
DATA.GOV (http://catalog.data.gov/dataset)
This is another repository of federal data.
Quandl (http://www.quandl.com)
This is a repository of more than 10 million datasets that are available for free download in several formats, including R data frames. Compared to many other sources, Quandl is easy to work with. Install and load the Quandl package:
> install.packages("Quandl")
> library(Quandl)

Browse the Quandl web page until you find a file that you want. For example, suppose that you chose the FBI “Crimes by State” file for Pennsylvania at http://www.quandl.com/FBI_UCR/USCRIME_STATE_PENNSYLVANIA. You can load it into an R data frame, penn.crime, with one command:

> penn.crime = Quandl("FBI_UCR/USCRIME_STATE_PENNSYLVANIA")

Importing Data of Various Types into R

R can read data in many different formats. Importing data from some of the most important ones is discussed in this section.

CSV

Our first example is a simple CSV file from the National Science Foundation. Note that it looks very much like the example in the section “Reading from an External File”; however, because this file is not in a working directory on your computer, you must include the entire URL in quotes—identifying the web page from which it comes—as shown here:

> nsf2011 = read.csv(
  "http://www.nsf.gov/statistics/ffrdc/data/exp2011.csv",
    header=TRUE)

Statistical Packages (SPSS, SAS, Etc.)

I found an interesting dataset at the Association of Religion Data Archives (http://www.thearda.com). After reading about ARDA, click Data Archive on the Menu bar at the top of the page to see what datasets are available. Datasets come in many different formats. As an example, you can download “The Gravestone Index,” collected by Wilbur Zelinsky, at http://www.thearda.com/Archive/Files/Downloads/CEMFILE_DL.asp in any of three versions. Two formats, SPSS and Stata, were designed for rival statistical software packages. An R package called foreign can translate either of these formats into an R data frame. Here’s how to install and load the package:

> install.packages("foreign", dependencies=TRUE)
> library(foreign)

After downloading the SPSS file to your working directory, you can read it into an R data frame named stone by  using the following commands:

> stone=read.spss("The Gravestone Index.SAV",to.data.frame=TRUE)
> fix(stone)  # look at the data in the editor

The foreign package can also read and write other data formats, such as Minitab, SAS, Octave, and Systat. You can learn more about the foreign package by using this command:

> library(help=foreign)

ASCII

The Gravestone file is also available as an ASCII file with fixed-width format. This means that the data falls into fixed positions on a line, without a space or other separator between data points. The first few lines and last few lines look like this:

11862  8  1182000000000000000000000000000000000
11868  8  1182000000000010000000000000000000000
11875  8  1182000000000000000000000000000000000
11910  8  1182000000000000000000000000000000000
11885  8  1182000000000000000000000000000000000
11861  8  1182000000000010000000000000000000000
11864  8  1182000000000010000000000000000000000

52003 18  64120000    0 110  0 00000001 0 0000000000 0
52003 18  64120000    0 1 0  0 00000000 0 0000000000 0
52007 18  64120000    0 0 0  0 0000000010 0010000000 0
52003 18  64120000    0 0 0  0 00000000 0 0000000000 0
51990 18  64120000    0 0 0  0 00000000 0 0000000000 0

The first part of the codebook looks like this:

1) BOOKNUM: 1
2) YBIRTH: 2-5
3) CEMNAME: 6-8
4) YEAREST: 9-10
5) CITYCEM: 11-12
6) COUNTRY: 13
7) COLLYEAR: 14
8) GOTHICW: 15
9) MARRIAG: 16
10) HEART: 17
11) HEARTSS: 18
12) MILITAR: 19
13) SECMESS: 20
14) OCCUPAT: 21

You can read the data by using the read.fwf() (read fixed-width format) command. Notice that there is no header information. Including the header=TRUE argument would give misleading information to R, which would try to assign variable names according to the numbers in the first row. This would result in an error message. It will be necessary to include the widths argument, followed by a vector giving the column widths of each of the variables, as indicated in the codebook. The first variable, BOOKNUM, is one column; the second variable, YBIRTH, four columns (2 through 5); the third variable, CEMNAME, is three columns (6 through 8); and so on. The following command reads the ASCII file, which has been copied to the working directory:

gs = read.fwf("The Gravestone Index.DAT",widths = c(1,4,3,2,2,1,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,1))

Alternatively, instead of typing 42 1s, you can use the rep() function to accomplish the same thing:

> gs = read.fwf("The Gravestone Index.DAT",
    widths = c(1,4,3,2,2,rep(1,42)))

The SPSS datafile of this same data had variable names, but the ASCII file comes without names and the variables are assigned names of V1, V2, V3, and so on. We can give the variables real names by creating a new vector with the names from the codebook:

vars = c("BOOKNUM", "YBIRTH", "CEMNAME", "YEAREST", "CITYCEM",
"COUNTRY", "COLLYEAR", "GOTHICW", "MARRIAG", "HEART", "HEARTSS",
"MILITAR", "SECMESS", "OCCUPAT", "PICTORIA", "DECEAEL", 
"PHOTODEC", "RELMES", "SYMBOL", "ANGEL", "SYMBOOK", "SYMDEATH", 
"DOVE", "FISH", "FINGERS", "SYMDIVIN", "GATES", "HANDSIP", 
"IHSE", "HANDS", "LAMB", "SYMCROSS", "STATUE", "STAANGEL",
"STABOOK", "STADIVIN", "STALAMB", "STACROSS", "EFFIGY", 
"WEEPWIL", "SECULAR", "SYMCHURC", "HANDSCIP", "CROWN", 
"STADOVE", "PICCHURC", "STADEATH")

Then we can give the variables of gs the names in the vars vector:

> names(gs) = vars

XML

XML (Extensible Markup Language) is a text format used for exchange of data. Because there are so many different formats for data, some of which are proprietary or even secret, it becomes virtually impossible to translate every format to every other one. XML, which is transparent and open, is a common means for sharing data among different computer systems and applications. There is an XML package for R, making it possible for R users to read and create XML documents. You can find the documentation for this package at http://www.omegahat.org/RSXML/. XML files can be considerably more complex than the simple flat files we have looked at so far. There will usually be some exploration of the XML file required—to learn its structure—before converting it to an R data frame. Following is an example of converting a relatively simple XML file to a data frame. This is the Federal Election Commission 2009–10 Candidate Summary File, which you can find at http://catalog.data.gov/dataset/2009-2010-candidate-summary-file.

You can do the conversion after you install and load the XML package:

> install.packages("XML",dependencies=TRUE)
> library(XML)
> cand = xmlToDataFrame("CandidateSummaryAction.xml")

netCDF

You can find the following dataset in the data repositories list on the National Snow and Ice Data Center (http://nsidc.org). I have chosen it to demonstrate another data type, the netCDF (Network Common Data Form) file. This format has become popular for storing large arrays of scientific data, especially geophysical data. Like XML, datasets in this format can be complex. Download the dataset by FTP from http://bit.ly/1jO6Ir9 and and save it your to your working directory. Install and load the ncdf package to work with this data in R:

> install.packages("ncdf")
> library(ncdf)

This dataset is a rather complex list of objects, each of which is itself a list of objects. In netCDF parlance, each of the main lists is a “variable.” To use the data in R, it is necessary make a subset of the data that will include the list of items associated with one variable. You can accomplish this as follows:

> ice = open.ncdf("seaice_conc_monthly_nh_f08_198707_v02r00.nc")
> # creates an R object named "ice"
> str(ice) # shows that ice is a list comprised of other lists
> icedata = get.var.ncdf(ice,"seaice_conc_monthly_cdr")
> close.ncdf(ice)

The names of the variables were discovered in the results of the str(ice) command, and seaice_conc_monthly_cdr was selected for the sake of this example. In most cases, you will need to know more about the data in order to select a variable name.

Web Scraping

It is also possible to copy data contained within web pages. This is commonly known as web scraping. A thorough discussion of the topic is beyond the scope of this book, but should you have a need to extract web data, a good place to start would be the help files for download.file() and readLines(). There are some packages that might be useful, such as RCurl, XML, and several others.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset