Managing data with R

One of the challenges faced while working with massive datasets involves gathering, preparing, and otherwise managing data from a variety of sources. Although we will cover data preparation, data cleaning, and data management in depth by working on real-world machine learning tasks in later chapters, this section highlights the basic functionality for getting data in and out of R.

Saving, loading, and removing R data structures

When you have spent a lot of time getting a data frame into the desired form, you shouldn't need to recreate your work each time you restart your R session. To save a data structure to a file that can be reloaded later or transferred to another system, use the save() function. The save() function writes one or more R data structures to the location specified by the file parameter. R data files have an .RData extension.

Suppose you had three objects named x, y, and z that you would like to save to a permanent file. Regardless of whether they are vectors, factors, lists, or data frames, they can be saved to a file named mydata.RData using the following command:

> save(x, y, z, file = "mydata.RData")

The load() command can recreate any data structures that have been saved to an .RData file. To load the mydata.RData file created in the preceding code, simply type:

> load("mydata.RData")

This will recreate the x, y, and z data structures in your R environment.

Tip

Be careful what you are loading! All data structures stored in the file you are importing with the load() command will be added to your workspace, even if they overwrite something else you are working on.

If you need to wrap up your R session in a hurry, the save.image() command will write your entire session to a file simply called .RData. By default, R will look for this file the next time you start R, and your session will be recreated just as you had left it.

After you've been working in an R session for some time, you may have accumulated a number of data structures. The listing function ls()returns a vector of all data structures currently in memory. For example, if you've been following along with the code in this chapter, the ls() function returns the following:

> ls()
[1] "blood"        "flu_status"   "gender"       "m"           
[5] "subject_name" "subject1"     "symptoms"    
[9] "temperature"

R automatically clears all data structures from memory upon quitting the session, but for large objects, you may want to free up the memory sooner. The remove function rm() can be used for this purpose. For example, to eliminate the m and subject1 objects, simply type:

> rm(m, subject1)

The rm() function can also be supplied with a character vector of object names to remove. This works with the ls() function to clear the entire R session:

> rm(list = ls())

Be very careful when executing the preceding code, as you will not be prompted before your objects are removed!

Importing and saving data from CSV files

It is very common for public datasets to be stored in text files. Text files can be read on virtually any computer or operating system, which makes the format nearly universal. They can also be exported and imported from and to programs such as Microsoft Excel, providing a quick and easy way to work with spreadsheet data.

A tabular (as in "table") data file is structured in matrix form, such that each line of text reflects one example, and each example has the same number of features. The feature values on each line are separated by a predefined symbol known as a delimiter. Often, the first line of a tabular data file lists the names of the data columns. This is called a header line.

Perhaps the most common tabular text file format is the comma-separated values (CSV) file, which, as the name suggests, uses the comma as a delimiter. CSV files can be imported to and exported from many common applications. A CSV file representing the medical dataset constructed previously could be stored as:

subject_name,temperature,flu_status,gender,blood_type
John Doe,98.1,FALSE,MALE,O
Jane Doe,98.6,FALSE,FEMALE,AB
Steve Graves,101.4,TRUE, MALE,A

Given a patient data file named pt_data.csv located in the R working directory, the read.csv() function can be used as follows to load the file into R:

> pt_data <- read.csv("pt_data.csv", stringsAsFactors = FALSE)

This will read the CSV file into a data frame titled pt_data. Just as we had done previously when constructing a data frame, we need to use the stringsAsFactors = FALSE parameter to prevent R from converting all text variables to factors. Unless you are certain that every column in the CSV file is truly a factor, this step is better left to you, not R, to perform.

Tip

If your dataset resides outside the R working directory, the full path to the CSV file (for example, "/path/to/mydata.csv") can be used when calling the read.csv() function.

By default, R assumes that the CSV file includes a header line listing the names of the features in the dataset. If a CSV file does not have a header, specify the option header = FALSE as shown in the following command, and R will assign default feature names by numbering them as V1, V2, and so on:

> mydata <- read.csv("mydata.csv", stringsAsFactors = FALSE,
                       header = FALSE)

The read.csv() function is a special case of the read.table() function, which can read tabular data in many different forms, including other delimited formats such as tab-separated values (TSV). For more detailed information on the read.table() family of functions, refer to the R help page using the ?read.table command.

To save a data frame to a CSV file, use the write.csv() function. If your data frame is named pt_data, simply enter:

> write.csv(pt_data, file = "pt_data.csv", row.names = FALSE)

This will write a CSV file with the name pt_data.csv to the R working folder. The row.names parameter overrides R's default setting, which is to output row names in the CSV file. Generally, this output is unnecessary and will simply inflate the size of the resulting file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset