Chapter 6. Reading Data into R

Now that we have seen some of R’s basic functionality, it is time to load in data. As with everything in R, there are numerous ways to get data; the most common is probably reading comma separated values (CSV) files. Of course there are many other options that we will cover as well.

6.1. Reading CSVs

The best way to read data from a CSV file is to use read.table. It might be tempting to use read.csv but that is more trouble than it is worth, and all it does is call read.table with some arguments preset. The result of using read.table is a data.frame.

The first argument to read.table is the full path of the file to be loaded. The file can be sitting on disk or even on the Web. For the purposes of this book we will read from the Web.

Any CSV will work but we have posted an incredibly simple CSV at http://www.jaredlander.com/data/Tomato%20First.csv. Let’s read that into R using read.table.

> theUrl <- "http://www.jaredlander.com/data/Tomato%20First.csv"
> tomato <- read.table (file = theUrl, header = TRUE, sep = ",")

This can now be seen using head.

> head(tomato)

  Round             Tomato Price      Source Sweet Acid Color Texture
1     1         Simpson SM  3.99 Whole Foods   2.8  2.8   3.7     3.4
2     1  Tuttorosso (blue)  2.99     Pioneer   3.3  2.8   3.4     3.0
3     1 Tuttorosso (green)  0.99     Pioneer   2.8  2.6   3.3     2.8
4     1     La Fede SM DOP  3.99   Shop Rite   2.6  2.8   3.0     2.3
5     2       Cento SM DOP  5.49  D Agostino   3.3  3.1   2.9     2.8
6     2      Cento Organic  4.99  D Agostino   3.2  2.9   2.9     3.1
  Overall Avg.of.Totals Total.of.Avg
1     3.4          16.1         16.1
2     2.9          15.3         15.3
3     2.9          14.3         14.3
4     2.8          13.4         13.4
5     3.1          14.4         15.2
6     2.9          15.5         15.1

As mentioned before, the first argument is the file name in quotes (or as a character variable). Notice how we explicitly used the argument names file, header and sep. As discussed in Section 4.5, function arguments can be specified without the name of the argument (positionally indicated) but specifying the arguments is good practice. The second argument, header, indicates that the first row of data holds the column names. The third argument gives the delimiter separating data cells. Changing this to other values such as “ ” (tab delimited) or “;” (semicolon delimited) allows it to read other types of files.

One often unknown argument that is helpful to use is stringsAsFactors. Setting this to FALSE (the default is TRUE) prevents character columns from being converted to factor columns. This both saves computation time—this can be dramatic if it is a large dataset with many character columns with many unique values—and keeps the columns as character data, which are easier to work with.

Although we do not mention this argument in Section 5.1, stringsAsFactors can be used in data.frame. Re-creating that first bit of code results in an easier-to-use “Sport” column.

> x <- 10:1
> y <- -4:5
> q <- c("Hockey", "Football", "Baseball", "Curling", "Rugby",
+        "Lacrosse", "Basketball", "Tennis", "Cricket", "Soccer")
> theDF <- data.frame(First=x, Second=y, Sport=q, stringsAsFactors=FALSE)
> theDF$Sport

 [1] "Hockey"      "Football"   "Baseball"     "Curling"      "Rugby"
 [6] "Lacrosse"    "Basketball" "Tennis"       "Cricket"      "Soccer"

There are numerous other arguments to read.table, the most useful being quote and colClasses, specifying the character used for enclosing cells and the data type for each column, respectively.

Sometimes CSVs (or tab delimited files) are poorly built, where the cell separator has been used inside a cell. In this case read.csv2 (or read.delim2) should be used instead of read.table.

6.2. Excel Data

While Excel may be the world’s most popular data analysis tool, it is unfortunately difficult to read Excel data into R. The simplest method would be to use Excel (or another spreadsheet program) to convert the Excel file to a CSV file. That might sound like a cop-out but it is the easiest method to use. The R community abounds with hacks to get data from Excel into R, such as using the Clipboard to copy and paste, but those are inelegant at best and can fail with large amounts of data.

A number of packages exist to tackle this problem such as gdata, XLConnect, xlsReadWrite, and others but they all have some erroneous requirement such as Java, Perl or 32-bit R, which is neither preferable nor so common anymore. The RODBC package has a function, odbcConnectExcel2007, that reads Excel files but requires a DSN1 connection, which is not a feasible everyday strategy.

1. A DSN is a data source connection used to describe communication to a data source, often a database.

We understand that Excel 2007 files are essentially XML files. This would mean that they could theoretically be parsed using the XML package, but we have not seen this done as of yet.

6.3. Reading from Databases

Databases arguably store the vast majority of the world’s data. Most of these, whether they be Microsoft SQL Server, DB2, MySQL or Microsoft Access, provide an ODBC connection. Accordingly, R makes use of ODBC through the aptly named RODBC package (which comes with base R). Like any other package, it must be loaded before use.

> require(RODBC)

The first step to reading from a database is to create a DSN. This differs by operating system but should result in a string name for that connection. This is used in odbcConnect to create a connection for R to use. Optional, but common, arguments are uid and pwd for the database username and password, respectively.

> db <- odbcConnect("QV Training")

At this point we are ready to run a query on that database using sqlQuery. This can be any valid SQL query of arbitrary complexity. sqlQuery returns a data.frame just like any other. Fortunately, sqlQuery has the stringsAsFactors argument first seen in Section 6.1. Again, setting this to TRUE is usually a good idea, as it will save processing time.

> # simple SELECT * query from one table
> ordersTable <- sqlQuery(db, "SELECT * FROM Orders",
          stringsAsFactors=FALSE)
> # simple SELECT * query from one table
> detailsTable <- sqlQuery(db, "SELECT * FROM [Order Details]",
          stringsAsFactors=FALSE)
> # do a join between the two tables
> longQuery <- "SELECT * FROM Orders, [Order Details]
          WHERE Orders.OrderID = [Order Details].OrderID"
> detailsJoin <- sqlQuery(db, longQuery, stringsAsFactors=FALSE)

We can easily check the results of these queries by viewing the resulting data.frames.

> head(ordersTable)

  OrderID  OrderDate CustomerID EmployeeID ShipperID Freight
1   10248 2008-06-29          4          2         2   43.48
2   10249 2007-06-29         79          7         2   29.20
3   10250 2008-07-03         34          2         2   79.17
4   10251 2007-12-02          1          7         2   43.41
5   10252 2008-04-04         76          5         1   23.20
6   10253 2008-07-05         34          3         2   66.54

> head(detailsTable)

  OrderID LineNo ProductID Quantity UnitPrice Discount
1   10402      2        63       65     18.94     0.00
2   10403      1        48       70     31.83     0.15
3   10403      2        16       21     10.15     0.15
4   10404      1        42       40     13.37     0.05
5   10404      2        49       30     19.82     0.05
6   10404      3        26       30     33.93     0.05

> head(detailsJoin)

  OrderID  OrderDate CustomerID EmployeeID ShipperID Freight
1   10402 2006-04-28         20          4         1   46.63
2   10403 2006-09-28         20          4         1   26.43
3   10403 2006-09-28         20          4         1   26.43
4   10404 2006-04-19         49          6         1   72.73
5   10404 2006-04-19         49          6         1   72.73
6   10404 2006-04-19         49          6         1   72.73
  OrderID.1 LineNo ProductID Quantity UnitPrice Discount
1     10402      2        63       65     18.94     0.00
2     10403      1        48       70     31.83     0.15
3     10403      2        16       21     10.15     0.15
4     10404      1        42       40     13.37     0.05
5     10404      2        49       30     19.82     0.05
6     10404      3        26       30     33.93     0.05

While it is not necessary, it is good practice to close the ODBC connection using odbcClose, although it will close automatically when either R closes or we open another connection using odbcConnect. Only one connection may be open at a time.

6.4. Data from Other Statistical Tools

In an ideal world another tool besides R would never be needed, but in reality data are sometimes locked in a proprietary format such as those from SAS, SPSS or Octave. The foreign package provides a number of functions similar to read.table to read in data from other tools.

A partial list of functions to read data from commonly used statistical tools is in Table 6.1. The arguments for these functions are generally similar to read.table. These functions usually return the data as a data.frame but do not always succeed.

Image

Table 6.1 Functions for Reading Data from Some Commonly Used Statistical Tools

While read.ssd can read SAS data, it requires a valid SAS license. This can be sidestepped by using Revolution R from Revolution Analytics with their special RxSasData function in their RevoScaleR package.

6.5. R Binary Files

When working with other R programmers, a good way to pass around data—or any R objects like variables and functions—is to use RData files. These are binary files that represent R objects of any kind. They can store a single object or multiple objects and can be passed among Windows, Mac and Linux without a problem.

First, let’s create an RData file, remove the object that created it, and then read it back into R.

> # save the tomato data.frame to disk
> save(tomato, file = "data/tomato.rdata")
> # remove tomato from memory
> rm(tomato)
> # check if it still exists
> head(tomato)

Error: object 'tomato' not found

> # read it from the rdata file
> load("data/tomato.rdata")
> # check if it exists now
> head(tomato)

  Round             Tomato Price      Source Sweet Acid Color Texture
1     1         Simpson SM  3.99 Whole Foods   2.8  2.8   3.7     3.4
2     1  Tuttorosso (blue)  2.99     Pioneer   3.3  2.8   3.4     3.0
3     1 Tuttorosso (green)  0.99     Pioneer   2.8  2.6   3.3     2.8
4     1     La Fede SM DOP  3.99   Shop Rite   2.6  2.8   3.0     2.3
5     2       Cento SM DOP  5.49  D Agostino   3.3  3.1   2.9     2.8
6     2      Cento Organic  4.99  D Agostino   3.2  2.9   2.9     3.1
  Overall Avg.of.Totals Total.of.Avg
1     3.4          16.1         16.1
2     2.9          15.3         15.3
3     2.9          14.3         14.3
4     2.8          13.4         13.4
5     3.1          14.4         15.2
6     2.9          15.5         15.1

Now let’s create a few objects to store in a single RData file, remove them and then load them again.

> # create some objects
> n <- 20
> r <- 1:10
> w <- data.frame(n, r)
> # check them out
> n

[1] 20

> r

 [1] 1 2 3 4 5 6 7 8 9 10

> w

    n  r
1  20  1
2  20  2
3  20  3
4  20  4
5  20  5
6  20  6
7  20  7
8  20  8
9  20  9
10 20 10

> # save them
> save(n, r, w, file = "data/multiple.rdata")
> # delete them
> rm(n, r, w)
> # are they gone?
> n

Error: object 'n' not found

> r

Error: object 'r' not found

> w

Error: object 'w' not found

> # load them back
> load("data/multiple.rdata")
> # check them out again
> n

[1] 20

> r

 [1] 1 2 3 4 5 6 7 8 9 10

> w

    n  r
1  20  1
2  20  2
3  20  3
4  20  4
5  20  5
6  20  6
7  20  7
8  20  8
9  20  9
10 20 10

6.6. Data Included with R

R and some packages come with data included, so we can easily have data to use. Accessing these data is simple as long as we know what to look for. ggplot2, for instance, comes with a dataset about diamonds. It can be loaded using the data function.

> require(ggplot2)
> data(diamonds)
> head(diamonds)

  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

To find a list of available data, simply type data() into the console.

6.7. Extract Data from Web Sites

These days a lot of data are displayed on Web pages. If we are lucky, it is stored neatly in an HTML table. If we are not so lucky, we might need to parse the text of the page.

6.7.1. Simple HTML Tables

If the data are stored neatly in an HTML table we can use readHTMLTable in the XML package to easily extract it. On my site there is a post about a Super Bowl pool I was asked to analyze at http://www.jaredlander.com/2012/02/another-kind-of-super-bowl-pool. In that post there is a table with three columns that we wish to extract. It is fairly simple to do with the following code.

> require(XML)
> theURL <- "http://www.jaredlander.com/2012/02/another-kind-of-
+        super-bowl-pool/"
> bowlPool <- readHTMLTable(theURL, which = 1, header = FALSE,
+          stringsAsFactors = FALSE)
> bowlPool

               V1      V2        V3
1   Participant 1 Giant A Patriot Q
2   Participant 2 Giant B Patriot R
3   Participant 3 Giant C Patriot S
4   Participant 4 Giant D Patriot T
5   Participant 5 Giant E Patriot U
6   Participant 6 Giant F Patriot V
7   Participant 7 Giant G Patriot W
8   Participant 8 Giant H Patriot X
9   Participant 9 Giant I Patriot Y
10 Participant 10 Giant J Patriot Z

Here the first argument was the URL but it could have also been a file on disk. The which argument allows us to choose which table to read if there are multiple tables. For this example, there was only one table but it could have easily been the second or third or fourth. We set header to FALSE to indicate that no header was in the table. Last, we used stringsAsFactors=FALSE so that the character columns would not be converted to factors.

6.7.2. Scraping Web Data

If the data are not so neatly stored, it is possible to scrape them off the page, although this is a very involved process. It requires good pattern matching and regular expressions, which are covered in Section 13.14. The idea is to figure out what common pattern surrounds different pieces of data, and this requires at least a basic knowledge of HTML.

6.8. Conclusion

Reading data is the first step to any analysis; without the data there is nothing to do. The most common way to read data into R is from a CSV using read.table. RODBC provides an excellent method for reading from any database with a DSN. Reading from data trapped in HTML tables is made easy using the XML package. R also has a special binary file format, RData, for the quick storage, loading and transfer of R objects.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset