Hour 10. Importing and Exporting

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Hour 10. Importing and Exporting

What You’ll Learn in This Hour:

Storage of data in R

Working with flat files

Connecting to databases

Working with Microsoft Excel

In Hours 3 through 6, we looked at the various mechanisms for storing data in R and some useful functions for manipulating modes of R data. In this hour, you are introduced to the common methods for importing and exporting data. By the end of the hour you will have seen how R can be used to read and write flat files and connect to database management systems (DBMSs) as well as Microsoft Excel.

Working with Text Files

Everyday R users tend to prefer importing and exporting Comma Separated Value (CSV) and other text-based (“flat file”) formats. Text files are, of course, completely open and can easily be generated from any analysis tool. Reading flat files in to R (and exporting them) is very straightforward.

Tip: File Navigation

The file.choose function allows us to browse and select a file to import using our operating system’s standard file browsing interface.

Perhaps the easiest way to import a text file in RStudio is via the menu system. The Import Wizard can be started by navigating to Tools > Import Dataset > From Text File and then navigating to the file you wish to import. The wizard looks at the file and tries to evaluate whether your dataset has headers and which character separates columns. In most cases the defaults are correct, and you simply need to click the Import button when you are ready to import your data.

Reading in Text Files

The RStudio import feature is, of course, unique to RStudio. However, if you try it, you will notice that, like many of the menu features in RStudio, it produces the line of R code required to read in the data, which is great if you work in a heavily regulated industry where reproducible code is a necessity. We will now look at the functions read.table and read.csv used by the Import Wizard.

The read.table function reads tabular information from a text file and returns a data frame. An example of using read.table to read in djiData.csv, embedded within the mangoTraining package, is shown in Listing 10.1. In the example, we assume that the data has been copied to our current working directory for simplified file referencing. Note that when we call read.table, we create a named R object. This is how we will refer to the dataset once we have read it into R. If we don’t do this, R will just print the dataset to the screen and we won’t be able to access it.

LISTING 10.1 Reading in Text Files

Click here to view code image

1: > djiData <- read.table("djiData.csv", header= TRUE, sep = ",")
2: > head(djiData,3)
3:         Date DJI.Open DJI.High  DJI.Low DJI.Close DJI.Volume DJI.Adj.Close
4: 1 12/31/2014 17987.66 18043.22 17820.88  17823.07   82840000      17823.07
5: 2 12/30/2014 18035.02 18035.02 17959.70  17983.07   47490000      17983.07
6: 3 12/29/2014 18046.58 18073.04 18021.57  18038.23   53870000      18038.23

The first line in Listing 10.1 only works because we first copy djiData.csv to our working directory. R then uses relative paths to find and import the data. If we instead place the file within a “data” directory within our working directory then we can import using the line:

Click here to view code image

> djiData <- read.table("data/djiData.csv", header= TRUE, sep = ",")

Alternatively, we can provide the full file path to the file; however, this makes our code less transferable, particularly when importing multiple files because we would have to change the file path for each dataset that we import. As we discussed in Hour 2, “The R Environment,” it is important to remember to use forward slashes when referencing file paths.

Tip: Package Data

In Listing 10.1, we copy the data from the mangoTraining package to our working directory in order to read it in. This highlights the ease with which data can be imported from our working directory. We normally extract data from an R package using the package argument contained within the system.file function:

Click here to view code image

> system.file(package = "mangoTraining", "extdata/djiData.csv")
[1] "C:/Program Files/R/R-3.1.2/library/mangoTraining/extdata/djiData.csv"

Using the package argument contained within the system.file function allows us to write code that is independent of our own operating system and therefore more transferrable.

Caution: Case-Sensitivity for File Paths

The import and export functions within R work directly with the operating system. If you use an operating system such as Windows that is not case-sensitive, then there is no need to match case for the file path. In other words, djiData.csv is equivalent to djidata.csv. However, if you use an operating system such as Linux, which is case-sensitive, then this case-sensitivity must be respected in file paths.

The read.table function is a generic function for reading in text data, and it makes several assumptions about your data. The important assumptions (or defaults) are that the dataset does not have a row of column headings, that header = FALSE, and that elements are separated using a space (sep = " "). There are also function arguments to specify the symbol that represents missing data and the characters used for marking character data. In addition, we can choose which rows to start and stop reading the data from, which is particularly useful for text output where the first few lines are meta-information before the data actually begins.

Tip: The Windows Clipboard

In Windows you can copy and paste your data into R by taking advantage of the “clipboard.” Simply set the file argument in read.table to be file="clipboard". Setting sep=" " specifies a tab separator and allows you to copy and paste directly from Excel. However, this practice is generally discouraged as it is not reproducible.

Tip: Troublesome Factors

As you saw in Hour 5, “Dates, Times, and Factors,” when R creates a data frame, the default behavior is to convert anything nonnumeric into factors. This means that you have to carefully handle dates and other columns that have been turned into factors, as well as reorder or relabel factor levels for the factors you do want. If this becomes a major part of your workflow, you might consider the stringsAsFactors argument to read.table. Setting stringsAsFactors=FALSE will prevent any columns being turned into factors, giving you more control over how your data is represented in R.

Reading in CSV Files

If you work with CSV files, sooner or later you will become tired of typing header=TRUE, sep="," each time you read in a dataset. The read.csv function is simply a wrapper for read.table that assumes your dataset has headers and that the separator is a comma. Note that we are still required to provide the “.CSV” file extension when specifying the file we want to read in, assuming that file has the correct extension.

Click here to view code image

> djiData <- read.csv("djiData.csv", header= TRUE, sep = ",")

Note: Comma Used as a Decimal Point?

In some European countries and other countries throughout the world, a comma is used as a decimal point instead of a period, and data elements are instead separated by semicolons. If you work with such data or have colleagues that do, then the read.csv2 function is designed specifically for such data.

Exporting Text Files

We can write data frames to CSV or other simple text formats using the write.csv or write.table function, respectively. As with read.csv and read.table, the write.csv function is simply a wrapper for write.table that reduces the number of required arguments when exporting .CSV files. Both functions expect the data frame that you want to export as the first argument and the name of the file that you want to create as the second.

As with the read.* functions, there are a number of other useful arguments that can assist with writing out data. In particular, the argument row.names = FALSE prevents the row names (which are often numbers) from being written to the output file. We can also control whether quotes are placed around character data as well as the character used to represent missing data. Here, we write out the internal airquality dataset to our working directory:

Click here to view code image

> write.csv(airquality, "airquality.csv", row.names = FALSE)

Faster Imports and Exports

The package data.table has a function called fread that is much faster for large files. The fread function is also generally easier to use than read.table because it guesses the separator and can interpret common column types that are known to cause trouble for R users. We will look closer at data.table and fread in Hour 12, “Efficient Data Handling in R.”

Another alternative for flat files is readr, released to CRAN by popular R package author Hadley Wickham in 2015. As with fread, the aim of the functions within readr is to improve the speed at which (large) CSV and other flat files can be read into R as well as to interpret common column types to save post-processing effort on the part of the user. The package also produces data frames in a “tbl_df” format, ready for use with the dplyr package, which we will look at in Hour 11, “Data Manipulation and Transformation,” and Hour 12. The main function in readr for reading .CSV files into R is the read_csv function.

Neither data.table nor readr are installed as part of the base R distribution and must therefore be installed separately.

Efficient Data Storage

As you saw in Hour 2, when we close R (or RStudio) we have the option of saving our workspace. By saving the objects in our workspace, we are moving them from memory to a single “.RData” file stored on disc. When we start a new R session, our workspace is restored to the same state as when we closed R down.

Caution: Restoring Sessions

When we start a new session using an .RData file, it restores all of the objects but it does not reload all of the packages we were using. Clearly this will cause some problems if any of our objects rely on functionality within the packages that were loaded. Be sure to reload any necessary packages when starting a new session from an .RData file.

To avoid errors and ensure reproducible code, it is generally better to work with a clean environment than rely on a saved workspace. The .RData format is exclusive to R and is therefore not a suitable means of transferring data between applications. However, it can be used as an efficient means of storing large interim datasets during an analysis. A similar .rds format can be used for saving individual datasets.

To illustrate the efficiency of the .RData and .rds formats, let’s create a data frame with 10 million rows and write it out to.CSV, .RData, and .rds formats:

Click here to view code image

> longData <- data.frame(ID = 1:10000000, Value = rnorm(10000000))
> write.csv(longData, "longData.csv", row.names = F)
> save(longData, file = "longData.RData")
> saveRDS(longData, file = "longData.rds")

We start by deleting the longData object from our session. Now let’s read in the .CSV file and time the operation with a function called system.time:

Click here to view code image

> rm(longData)
> system.time(longData <- read.csv("longData.csv"))
user system elapsed
118.04 1.03 119.31

I’m using a decent machine here with 8GB RAM running 64-bit R, so nearly 2 minutes of elapsed time is pretty slow. So how does load perform with the .RData and .rds file types?

Click here to view code image

> rm(longData)
> system.time(load("longData.RData"))
   user  system elapsed
   0.78    0.03    0.81
> rm(longData)
> system.time(load("longData.RData"))
   user  system elapsed
   0.81    0.03    0.84

Using the R formats, we are down to less than a second, which is a huge difference. Incidentally, the read_csv function from readr and fread from data.table both managed the same .CSV import in less than 10 seconds. We will look more closely at some R packages that can generally improve R’s speed and efficiency when handling large data during Hour 11 and Hour 12. We will also look at code efficiency in Hour 18, “Code Efficiency.”

Proprietary and Other Formats

If you have previously been using another statistical analysis language such as SAS or SPSS, then you will probably find yourself needing to read .SAS7BDAT or .SAV files into R. One solution would be to use SAS or SPSS to write out a CSV file, which can easily be read into R; however, this is not always possible, and you may find yourself needing to read in data from SAS, SPSS, Stata, Minitab, and so on into R. Such data can (mostly) be read into R using the foreign package, which is a “recommended” R package and therefore distributed with each new version of R.

The foreign package is a small collection of functions to read and write data to some well-known data formats. The package functions very well; however, it is limited by proprietary formats. For example, in order to write to SAS, the package actually generates an intermediary text file and corresponding SAS script that it tries to call from your SAS installation in order to read the text into SAS.

Note: SAS Users

If you are a SAS user, you may find the package sas7bdat useful for reading and writing .SAS7BDAT files. However, you should be aware that the package is documented as being experimental in places and does not work in all cases. If you are working with transport files, the SASxport package provides tools for writing SAS transport files from R.

The haven package provides a wrapper for Evan Miller’s ReadStat C library and offers an alternative to foreign. The package is still in its infancy and limited to SAS, SPSS, and STATA, but unlike foreign it can read the proprietary .SAS7BDAT format, and like readr it can correctly interpret some date formats and generate data that is ready for dplyr.

Relational Databases

Unfortunately there is no “one-size-fits-all” solution to working with relational databases in R. There are a few general-purpose packages for working with databases, but for the best results you are better off looking for the package that has been built specifically for the database that you are using.

The approach that the various database packages take in R is very much the same. There are typically one or more functions to assist with making a connection to the database, plus a number of utility functions that wrap up common tasks that you might perform in SQL. If you are familiar with SQL, though, you may prefer to write SQL directly, which all the main packages allow you to do.

RODBC

The RODBC package is probably the most well established method for connecting to a database from R. Note that the package is not installed by default; it must first be installed and loaded. As the name suggests, it implements standard ODBC database connectivity. You can therefore use RODBC to connect to all the popular DBMSs: Oracle, MySQL, Microsoft Access as well as SQL Server, PostgreSQL, and SQLite. You can even use RODBC to connect to Excel spreadsheets!

Let’s look at an example of an RODBC workflow using the well-known training database distributed with Microsoft Access: Northwind.mdb. The package is available online via the book’s website or within the mangoTraining package. To find the file within mangoTraining we can use the following line:

Click here to view code image

> system.file(package = "mangoTraining", "extdata/Northwind.mbd")

The RODBC package contains a general-purpose odbcConnect function for connecting to any database, though for Access we can use a “convenience wrapper,” odbcConnectAccess. As always, when importing or connecting to external data from R it is important that we name the connection in order to be able to refer to it. If a username and password is required, these can be entered using the arguments uid and pwd. We start by loading the RODBC package and making a connection to the database. In the following example, it is assumed that the database has been placed in our current working directory. We therefore provide the file name only. Alternatively, a full file path can be specified.

Click here to view code image

> library(RODBC)
> nWind <- odbcConnectAccess("Northwind.mdb")

Caution: Windows Architecture

The odbcConnectAccess function only works with 32-bit versions of the Microsoft drivers. These cannot be used when working in 64-bit R. For Access 2007 and beyond, there is the option to install 64-bit drivers, though the drivers cannot be installed with 32-bit Office. These compatibility issues can make RODBC difficult (but not impossible) to set up in a managed IT environment. If you run into problems, a sensible first step is to check whether you are running 32-bit or 64-bit R using Sys.getenv("R_ARCH").

The RODBC package contains a number of utility functions, such as sqlTables, that can be used to explore the database. The first each of the utility functions is always the name of the connection:

Click here to view code image

> nwTableData <- sqlTables(nWind)
> nwTableData[1:3, c("TABLE_NAME", "TABLE_TYPE")]    # Preview main information
         TABLE_NAME   TABLE_TYPE
1 MSysAccessObjects SYSTEM TABLE
2          MSysACEs SYSTEM TABLE
3       MSysCmdbars SYSTEM TABLE

Another useful function is sqlColumns, which returns information about the columns within a specific table:

> sqlColumns(nWind, "Orders")

In order to extract data from the database, we can use wrappers such as sqlFetch to import an entire table or subsets of rows, or we can use SQL commands directly via sqlQuery:

Click here to view code image

> orderQuery <- "SELECT OrderID, EmployeeID, OrderDate, ShipCountry FROM Orders"
> keyOrderInfo <- sqlQuery(nWind, orderQuery)
> head(keyOrderInfo, 3)
  OrderID EmployeeID  OrderDate ShipCountry
1   10248          5 1996-07-04      France
2   10249          6 1996-07-05     Germany
3   10250          4 1996-07-08      Brazil

Further utility functions exist in order to clear the rows of a table (sqlClear), drop the table entirely (sqlDrop), and add new tables (sqlSave). When we have finished working with the database, it is important to remember to close the connection, like so:

> odbcClose(nWind)

If making multiple connections, we can use the odbcCloseAll function to close all of them in a single command.

DBI

The RODBC package is an extremely popular, well-tested package, but it is certainly not the only option available. Away from RODBC, the vast majority of R packages available for connecting to databases implement a standard database interface (DBI). Packages such as ROracle, RJDBC, RPostgreSQL, RMySQL, RMySQLite, and many more use the interface, which is wrapped in an R package, DBI.

The aim of the DBI is to ensure consistency when working with databases. Each of the packages that uses the interface contains a common set of functions that behave in the same way regardless of which package you are using or which database you are connecting to. The only difference is the connection itself. The standard set of functions follow the format db* (for example, dbReadTable). This standardization makes it incredibly easy to switch between packages because, once you’ve learned how to use one, you can essentially use them all. Alternatively, you can use DBI directly, as Listing 10.2 demonstrates, via RSQLite. Note how similar the approach is to the RODBC package, despite the fact that RODBC does not follow DBI.

LISTING 10.2 Using DBI Directly

Click here to view code image

1: > library(DBI)
2: > library(RSQLite)    # We create a SQLite DB
3: > # Create a new SQLite database in-memory
4: > dbiCon <- dbConnect(SQLite(), dbname = ":memory:")
5: >
6: > # Write airquality to the DB as a new table
7: > dbWriteTable(dbiCon, "airquality", airquality)
8: [1] TRUE
9: >
10: > # Check what columns (fields) are in the airquality table
11: > dbListFields(dbiCon, "airquality")
12: [1] "Ozone"   "Solar.R" "Wind"    "Temp"    "Month"   "Day"
13: >
14: > # Send a query and return the result
15: > aQuery <- "SELECT * FROM airquality WHERE Month = 5 AND Wind > 15"
16: > dbiQuery <- dbSendQuery(dbiCon, aQuery)
17: > dbFetch(dbiQuery)
18:   Ozone Solar.R Wind Temp Month Day
19: 1     8      19 20.1   61     5   9
20: 2     6      78 18.4   57     5  18
21: 3    11     320 16.6   73     5  22
22: 4    NA      66 16.6   57     5  25
23: >
24: > dbClearResult(dbiQuery) # Be tidy!
25: [1] TRUE

Working with Microsoft Excel

If you are reading this book, there is an extremely high likelihood that either you or one of your close colleagues has been using Excel for day-to-day analysis. And why not?! So long as it’s not pushed beyond its limits, it’s a fantastic, easy-to-use tool for generating simple summary statistics. It’s also a tool that very few analysts are willing to throw away, even after they have seen what R is capable of. You won’t be surprised to learn, therefore, that there are a million and one R packages available for connecting R and Excel (well, more than 10 anyway). It probably also won’t surprise you that they all do it in a slightly different way.

Connecting to R from Excel

If you want to link R and Excel, you can either call R from Excel or call Excel from R. Those who want to call R from Excel usually do so because they have a large number of colleagues who are extremely comfortable in Excel and want any analysis to start and end in there. This approach is particularly common in the insurance industry, where the underwriters typically consume advanced algorithms that actuaries have developed in R but via an Excel front end.

There are a number of ways of calling R from Excel, depending on the level of sophistication you require. At some point, a Microsoft language such as VBA or C# will be required to call to R either via command line or using a technology such as RServe. The focus of this book is on R, however, so we will look at the methods for connecting to Excel from R.

Reading Structured Data from Excel

If you have structured data—that is, data that is neatly laid out such that each tab of your workbook contains just a single table of data, usually stored in the top-left corner of the sheet—then there are some very efficient options available to you for reading in data from Excel. One such package is RODBC, which you have just seen in the context of databases. Using RODBC, we connect to a workbook using the odbcConnectExcel function for .XLS files or odbcConnectExcel2007 for .XLSX files. We then treat the workbook like a mini database, where each tab is a separate table. All of the standard SQL wrappers work in the same way as for other types of database. The RJDBC package can similarly be used with Excel.

An alternative solution designed specifically to work with structured data in Excel is Hadley Wickham’s readxl package. This package was released in 2015 and, in a similar vein to readr, aims to improve the speed at which data can be read from Excel. Likewise, it also produces tbl_df output for use with dplyr.

Let’s start with a simple example using the airquality.xlsx workbook. This workbook can be found in the mangoTraining package. As with other examples in this hour, we can use the following line to locate the file within the package:

Click here to view code image

> system.file(package = "mangoTraining", "extdata/airquality.xlsx")

The workbook consists of a single sheet named “data” containing a copy of the internal airquality data frame. We start by loading the package and using the excel_sheets function to return the sheet names.

Click here to view code image

> library(readxl)
>
> # What sheets does the workbook contain?
> excel_sheets("airquality.xlsx")
[1] "data"

Next we use the primary read_excel function to read the airquality.xlsx file. We pass the name of the sheet we want to read as the second argument. As an alternative, we can provide the sheet position, in this case 1. Since 1 is also the default sheet number, we could also leave out the argument altogether in the following example:

Click here to view code image

> # Read in the "data" sheet
> air <- read_excel("airquality.xlsx", sheet = "data")
> head(air, 3)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3

The function automatically ignores blank rows and columns until it finds a cell containing data; however, we can control the row and column that it starts reading from using the arguments skip and col_names respectively. We can use the col_types argument to specify a vector of types of data contained within each column, including date ("date") type. The readxl package also works with the older .xls format. It cannot be used to write to Excel workbooks, however. For that we need one of four “all-rounder” packages.

Connecting to Excel from R

At the time of writing there are four “all-rounder” packages that can both read and write to Excel from R. Two of these four, XLConnect and xlsx, are very similar in their approach and use Java with the rJava package underneath to make the connection. The other two are the openxlsx package and the excel.link package, the most different of the four in terms of approach.

Each of the first three packages mentioned implements a similar workflow idea, albeit implemented in slightly different ways using functions with slightly different names. That workflow involves creating an image of the workbook in R that can be manipulated before saving any changes back to the workbook or to a new file. The excel.link package uses the RDCOMClient package to open an Excel workbook and edit it live using R code.

The XLConnect Package

Let’s walk through an example of a typical analysis workflow using XLConnect. In this workflow we will take the following steps:

1. Connect to a workbook.

2. Import data from one of the tabs.

3. Generate some statistical summaries of the key columns.

4. Create a simple plot (using the plot function from the graphics package, which is covered in Hour 13, “Graphics.”)

5. Write the summary data and graphic back to new tabs in the workbook.

6. Save the workbook with a new filename.

Making the connection results in a named R object that we must reference when using any of the other functions within the package. Note that, strictly speaking, we are not actually making a connection but a copy of the workbook, which is held in memory: The workbook can still be opened and edited from Excel while we are making changes in R. The loadWorkbook function can also be used to create new workbooks.

Click here to view code image

> airWB <- loadWorkbook("airquality.xlsx")

Caution: Java Dependency

Loading the XLConnect package is not as straightforward as for other packages due to the reliance on the rJava package, which itself has a reliance on the Java SE Development Kit, better known as JDK. If JDK is not installed, R cannot find JAVA_HOME and the XLConnect package fails to load. In most cases, simply installing the appropriate version of JDK (greater than 1.4 is required for rJava) for your operating system and architecture (that is, the 32-bit or the 64-bit version) and accepting all defaults fixes the issue. Instructions for installing JDK versions and the required executable can be found at http://www.oracle.com.

Once we have made our connection, we can use a function such as getSheets or getDefinedNames to explore the workbook:

> getSheets(airWB)
[1] "data"

Once we’re done exploring, we can use a function such as readWorksheet, readNamedRegion, or readTable to read in data from the workbook. In this case we use readWorksheet. The function automatically ignores blank rows and columns until it finds a cell containing data. Otherwise, we can use the arguments startRow, endRow, startCol, and endCol to specify the exact location of the data within the sheet. Note the use of the sheet name in the second argument. We could also have used the sheet index.

Click here to view code image

> air <- readWorksheet(airWB, "data")
> head(air)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Tip: Indexing Columns

In Excel, rows can be referenced numerically, whereas columns are referenced alphabetically. In R, we tend to work with numerical referencing for both, and the XLConnect package is no different. The col2idx function is a useful function for converting columns such as AA into their equivalent numeric position:

> col2idx("AA")
[1] 27

Next, we summarize the data using the aggregate function that we discuss in Hour 11 and create a plot in our working directory using the plot function from the graphics package, which we will explore fully in Hour 13.

Click here to view code image

> # Summary Data
> averageOzone <- aggregate(data = air, Ozone ~ Month, mean, na.rm = T)
>
> # Graphic as png
> png("Ozone_Levels.png")
> hist(air$Ozone, col = "lightblue",
+ main = "Histogram of Ozone Levels in New York May to September 1973",
+ xlab = "Ozone (ppb)")
> dev.off()

In this next-to-last step, we create a new sheet and load it with the summary data and graphic we just created. Note the use of createName to create a new named region within the workbook, which is then used to place the graphic. Note also the use of the argument originalSize = TRUE. This ensures that the image dimensions are retained and that it is not resized to fit the named region.

Click here to view code image

> # New tab
> createSheet(airWB, "Summary")
>
> # Write summary data
> writeWorksheet(airWB, averageOzone, "Summary", startRow = 2, startCol = 2)
>
> # Add graphic
> createName(airWB, "PlotGoesHere", "Summary!$E$2")
> addImage(airWB, filename = "Ozone_Levels.png", name = "PlotGoesHere",
+ originalSize = TRUE)

Finally, we set the Summary tab to be the current active tab so that when we next open the workbook, this is the tab we see and then save the workbook. A screenshot of the final workbook open in Excel is show in Figure 10.1.

Click here to view code image

> # Set active sheet and close
> setActiveSheet(airWB, "Summary")
> saveWorkbook(airWB, "air_summary.xlsx")

FIGURE 10.1 Writing data and graphics to Excel from R

XLConnect has many more features, many of which are replicated using similarly named functions in xlsx and openxlsx. Such features include formatting, writing Excel formulas, and merging cells.

In our experience, the biggest restriction of XLConnect (and xlsx) is the large amount of memory required when working with Excel workbooks. There are options for dealing with memory issues, but eventually you will reach a limit and may need to explore one of the other options.

Summary

In this hour, we looked at some of the primary methods for importing data into R for analysis. You saw how to easily read and write text files using read.table and read.csv, and if your data is large you can use faster alternatives within the data.table and readr packages. You also saw how R’s .RData format can be used as an efficient means for storing data on disk.

You also saw how to use either the RODBC or DBI syntax to connect to and edit a DBMS from R and how to connect to an Excel spreadsheet using XLConnect. The “Activities” section provides an opportunity for you to try these tools yourself. In the next hour, we will continue with the data workflow and look at manipulation and transformation in R.

Q&A

Q. A colleague of mine is using xlsx to connect to Excel. Should I encourage them to switch to XLConnect?

A. After installation there really is very little difference between the two packages (or openxlsx for that matter). In certain circumstances you may experience limitations with one or the other, but if your colleague is using xlsx and your only experience of XLConnect to date is what you’ve read in this hour, then you may as well begin learning xlsx.

Q. You say that RODBC can be used to read structured data from Excel. Can it be used to write data to Excel as well?

A. Absolutely. For reasons of efficiency this is not the default behavior, but if you specify readOnly = FALSE when calling odbcConnectExcel or odbcConnectExcel2007, you can override the default and write tables back to the spreadsheet.

Workshop

The workshop contains quiz questions and exercises to help you solidify your understanding of the material covered. Try to answer all questions before looking at the “Answers” section that follows.

Quiz

1. What argument prevents row numbers or names being written to a CSV file when using write.csv?

2. In which R packages would you find the functions read.csv, read_csv, and fread?

3. What binary format can you use to store R objects on disk?

4. Is it possible to use a 32-bit ODBC driver to connect to Excel from 64-bit R?

5. Which RMySQL function can be used to read tables from a database?

6. Name three packages that can be used to connect to Excel from R.

Answers

1. To prevent R’s default behavior of writing out what are known as “row names,” you specify row.names = FALSE.

2. The read.csv function is in the utils package that is distributed with R. The read_csv and fread functions can be found in reader and data.table, respectively.

3. You use the .RData format to save any number of objects from your workspace to disk. This facilitates easy loading later on.

4. No. You need to ensure that the R architecture matches the ODBC driver architecture to use RODBC.

5. We didn’t explicitly cover RMySQL, but it is a DBI package and therefore the dbReadTable function can be used.

6. In this hour, we mentioned several, including XLConnect, xlsx, openxlsx, excel.link, RODBC, and readxl. There is also the gdata package, which offers general programming tools for data manipulation.

Activities

1. Read the NST-EST2014-01.csv data containing annual estimates of the resident population for the United States, Regions, States, and Puerto Rico from April 1, 2010 to July 1, 2014, taken from the U.S. Census Bureau.

2. Write out R’s internal quakes dataset to a .CSV file. Ensure that row numbers are not written to the file.

3. Simulate a million records of a demographic data frame containing columns ID, Age, Sex, Weight, and Height and then save the data to an .RData file.

4. Make a connection to the Northwind database:

Create data tables from the Order Details and Orders tables.

Merge the two tables based on the Order ID.

Calculate the mean unit price by Customer ID.

Save this data back to the Northwind database.

5. Use the XLConnect package to create an Excel workbook containing R’s internal mtcars data:

Install JDK from http://www.oracle.com.

Install and load the XLConnect package.

Use loadWorkbook to create a new file.

Write the mtcars dataset to this file.

Save the workbook.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Hour 10. Importing and Exporting

Create new playlist

Sign In

Sign Up

Hour 10. Importing and Exporting

Working with Text Files

Reading in Text Files

Reading in CSV Files

Exporting Text Files

Faster Imports and Exports

Efficient Data Storage

Proprietary and Other Formats

Relational Databases

RODBC

DBI

Working with Microsoft Excel

Connecting to R from Excel

Reading Structured Data from Excel

Connecting to Excel from R

The XLConnect Package

Summary

Q&A

Workshop

Quiz

Answers

Activities

Table of Contents for
Hour 10. Importing and Exporting