In the previous recipe, we familiarized ourselves with heat map functions in R using built-in data. Now, we will learn how to use external data to create our heat maps.
In real life, we often have no control over the format of the data files that we download, or the data that is the output of a particular program. Because we can not always rely on luck that the data comes in the right format, we will learn in this recipe how to read in data from different file formats and get it into shape.
The following image shows a heat map that we are going to create from the gene_expression.txt
data set in this recipe:
Download the script 5644OS_02_01.r
and the data sets gene_expression.txt
, runners.csv
, and apple_stocks.xlsx
from your account at http://www.packtpub.com and save them to your hard drive.
I recommend that you download and save the script and data files to the same folder on your hard drive. If you execute the script from a different location to the location of the data files, you have to change the current R working directory accordingly.
You can change the current working directory of your R session by using the setwd()
function. If you saved the data files for this recipe under /home/user/Downloads
, for example, you would type setwd("/home/user/Downloads")
into the R command-line. Alternatively, you can uncomment the fourth line of the script and provide the location of the data file directory to the setwd()
function in a similar way.
For more information on how to view the current working directory of your current R session and an explanation on how to run scripts in R, please read the Getting ready section of the Creating your first heat map in R recipe.
After you have executed the following code from the 5644OS_02_01.r
script, take a look at the PDF file readingData.pdf
, which was created in the same location where you executed the script:
# if you are running the script from a different location # than the location of the data sets, uncomment the # next line and point setwd() to the data set location # setwd("/home/username/Datasets") ### loading packages if (!require("gplots")) { install.packages("gplots", dependencies = TRUE) library(gplots) } if (!require("lattice")) { install.packages("lattice", dependencies = TRUE) library(lattice) } if (!require("xlsx")) { install.packages("xlsx", dependencies = TRUE) library(xlsx) } pdf("readingData.pdf") ### loading data and drawing heat maps # 1) gene_expression.txt gene_data <- read.table("gene_expression.txt", comment.char = "/", blank.lines.skip = TRUE, header = TRUE, sep = " ", nrows = 20) gene_data <- data.matrix(gene_data) gene_ratio <- outer(gene_data[,"Treatment"], gene_data[,"Control"], FUN = "/") heatmap.2(gene_ratio, xlab = "Control", ylab = "Treatment", trace = "none", main = "gene_expression.txt") # 2) runners.csv runner_data <- read.csv("runners.csv") rownames(runner_data) <- runner_data[,1] runner_data <- data.matrix(runner_data[,2:ncol(runner_data)]) colnames(runner_data) <- c(2003:2012) runner_data[runner_data == 0.00] <- NA heatmap.2(runner_data, dendrogram = "none", Colv = NA, Rowv = NA, trace = "none", na.color = "gray", main = "runners.csv", margin = c(8,10)) # 3) apple_stocks.xlsx stocks_table <- read.xlsx("apple_stocks.xlsx", sheetIndex = 1, rowIndex = c(1:28), colIndex = c(1:5,7)) row_names <- (stocks_table[,1]) stocks_matrix <- data.matrix( stocks_table[2:ncol(stocks_table)]) rownames(stocks_matrix) <- as.character(row_names) stocks_data = t(stocks_matrix) print(levelplot(stocks_data, col.regions = heat.colors, margin = c(10,10), scales = list(x = list(rot = 90)), main = "apple_stocks.xlsx", ylab = NULL, xlab = NULL)) dev.off()
Generally, R is able to read data from any file that contains data in a proper text format. First, we will see how to use the universal read.table()
function to read data from a .txt
file. Next, the recipe shows us how to read data from a Comma Separated Values (CSV) file using the read.csv()
function. And finally, we take a look at the xlsx()
function from the xlsx
package, in order to process Microsoft Excel spreadsheet files.
gene_expression.txt
, contains exemplary gene expression data obtained from two different conditions: control and treatment. The individual values in this data file resemble fold-differences of gene expression relative to a housekeeping gene that was used as a reference to normalize the data.The data was saved as a .txt
file and consists of 105 lines. The first two lines from the top are comments about the data and begin with a slash (/
) as the first character. Followed by two blank lines, a header labels the two data columns of the expression data under the two conditions.
Notice that the data columns are separated by tab spaces as shown in the following screenshot of the gene_expression.txt
data set:
read.table()
function in order to read this data file as a data table into R:gene_data <- read.table("gene_expression.txt", comment.char = "/", blank.lines.skip = TRUE, header = TRUE, sep = " ", nrows = 20)
When we use the read.table()
function, R ignores every line in the data file that starts with a hash mark (#
), which is the default comment character. However, the first character of the comment lines in our data file is a forward slash (/
). Therefore, we have to pass it to the function as an additional argument for the comment.char
parameter so R can interpret the first two lines in gene_expression.txt
as comments and skip them. The second argument, blank.lines.skip = TRUE
, ensures that R also skips the two blank lines after the comment section.
From here on, R will read every line that follows and will interpret it as data—skipping the comment and blank lines at the beginning.
There are two data columns in this text file, which are labeled as Control
and Treatment
. By setting the header parameter to TRUE
, R will store these labels as a header for our data table. The header is followed by 100 rows of tab-delimited data (Gene1
to Gene100
). We need to use the argument sep = "/t"
to set the field separator character to a tab space, since the default data field separator is the white space, sep = ""
.
The data file is quite large with its 100 entries, and for this example, we just use the first 20 genes to create our heat map. We can tell R to stop reading data after the 20th entry by providing the argument nrows = 20
in the function call.
Treatment
column in relation to the Control
column, we are using the outer()
function. This function allows us to create a 20 x 20 matrix by dividing each gene from the Treatment
column (third column in the data file) by each gene from the Control
column (second column in the data file):gene_data <- data.matrix(gene_data) gene_ratio <- outer(gene_data[,"Treatment"], gene_data[,"Control"], FUN = "/")
runners.csv
file contains the fastest personal times of seven popular 100 meters sprinters for the years between 2003 and 2012.If we want to read a data file with comma-separated values, it is most convenient to use the read.csv()
function:
runner_data <- read.csv("runners.csv")
NA
, which stands for Not Available
. Therefore, if we read data from a file that contains empty fields, missing values will be replaced by NA
when R creates the data table.When we read in runner_data.csv
, we see that our data table contains many 0.00
values, which means that no time was recorded for the runner in those years. However, it would make more sense to have those values denoted as missing data (NA
) in our heat map.
So, if we want R to interpret values other than NA
or empty fields as missing values, we do this by providing an argument for the na.strings
parameter in our read.table()
or read.csv()
function:
runner_data <- read.csv("runners.csv", na.strings = 0.00)
By default, cells with NA
values will be left blank and appear as white cells in our heat map. This can be very misleading, since the color palette that we are using converges into a very bright yellow. It would be very hard to distinguish those empty cells from values that are seeded very high in our color key.
To avoid this confusion, we can simply assign a different color to those cells that contain missing values. Here, we are coloring them in gray:
heatmap.2(runner_data,
dendrogram = "none",
Colv = NA,
Rowv = NA,
trace = "none",
na.color = "gray",
main = "runners.csv",
margin = c(8,10))
The following screenshot shows the apple_stock.xlsx
data set opened in Microsoft Excel 2011:
R has no in-built function to read data from this proprietary format. But, we are lucky that someone developed the xlsx
package especially for this case that is freely available on CRAN. Therefore, we can use the read.xlsx()
function to conveniently get this data from apple_stock.xlsx
into R.
stocks_table <- read.xlsx("apple_stocks.xlsx", sheetIndex = 1, rowIndex = c(1:28), colIndex = c(1:5,7))
With sheetIndex
, we have selected which sheet we want to read from the Excel spreadsheet file—apple_stock.xlsx
only has data on sheet 1
. The spreadsheet consists of 7168 rows of data, but we are interested only in the recent data stock data of 2013. So, to only read in those first 28 lines, we included the rowIndex = c(1:28)
argument in the read.xlsx()
function call above. Furthermore, we are not interested in the Volume
column (column 6), so let's skip it via the argument colIndex = c(1:5,7)
.
In most cases, we will use the read.table()
function to read our data into R. The following table summarizes the most important options:
It might happen that we obtain our data in the so-called long format, which contains multiple rows for each individual category, individual or item.
To give you a better understanding, an excerpt from the runners.csv
data set in long format is shown as follows:
Year Runner Time 1 2007 Usain_Bolt 10.03 2 2008 Usain_Bolt 9.72 3 2009 Usain_Bolt 9.58 4 2010 Usain_Bolt 9.82 5 2011 Usain_Bolt 9.76 6 2012 Usain_Bolt 9.63 7 2004 Asafa_Powell 10.02 8 2005 Asafa_Powell 9.87 ...
Do you see the problem here? The Runner
column in the middle of the data table consists of character strings, which are incompatible with the numeric matrix format that is required by our heat map functions.
Fortunately, it is quite easy to convert data from long to wide format—we can simply use the cross tabulation function xtabs()
:
runners_wide <- xtabs(formula = Time ~ Runner + Year, data = runners_long)) Year Runner 2004 2005 2007 2008 2009 2010 2011 2012 Asafa_Powell 10.02 9.87 0.00 0.00 0.00 0.00 0.00 0.00 Usain_Bolt 0.00 0.00 10.03 9.72 9.58 9.82 9.76 9.63
Several countries use a decimal comma instead of a decimal point. Chances are high that you want to analyze a data set that comes from one of those countries. In this case you just have to provide an additional argument for the dec
parameter:
my_data <- read.table("data.txt", dec=",")