Figure 19-1: A model of expected sales and revenue.
Chapter 19
Ten Things You Can Do in R That You Would’ve Done in Microsoft Excel
In This Chapter
Performing calculations and other operations on data
Going beyond what you can do in a spreadsheet
The spreadsheet is probably one of the most widely used PC applications — and for good reason: Spreadsheets make it very easy to perform calculations and other operations on tabular data. But spreadsheets pose some risks as well: They’re easy to corrupt and very difficult to debug.
The good news is, you can use R to do many of the same things you used to do in spreadsheets. In R, you use data frames to represent tabular data. R has many functions, operators, and methods that allow you to manipulate and calculate on data frames. This means that you can do just about anything (and more) in R that you would’ve done in Microsoft Excel, LibreOffice Calc, or your favorite spreadsheet application.
In this chapter, we offer some pointers on functions that you can explore in R, most of which are covered earlier in this book. In most cases, we provide some sample code but not the results. Try these examples yourself, and remember to use the R Help documentation to find out more about these functions.
Adding Row and Column Totals
One task that you may frequently do in a spreadsheet is calculating row or column totals. The easiest way to do this is to use the functions rowSums()
and colSums()
. Similarly, use the functions rowMeans()
and colMeans()
to calculate means.
Try it on the built-in dataset iris
. First, remove the fifth column, because it contains text that describes the species of iris:
> iris.num <- iris[, -5]
Then calculate the sum and mean for each column:
> colSums(iris.num)
> colMeans(iris.num)
These two functions are very convenient, but you may want to calculate some other statistic for each column or row. There’s an easy way of traversing rows or columns of an array or data frame: the apply()
function (see Chapter 13). For example, getting the minimum of a column is the same as applying a min()
function to the second dimension of your data:
> apply(iris.num, 2, min)
> apply(iris.num, 2, max)
> sapply(iris.num, min)
> sapply(iris.num, max)
Formatting Numbers
When you produce reports, you want your numbers to appear all nicely formatted. For example, you may want to align numbers on the decimal points or specify the width of the column. Or you may want to print your number with a currency symbol ($100.00) or append a percentage sign to it (35.7%).
You can use format()
to turn your numbers into pretty text, ready for printing. This function takes a number of arguments to control the format of your result. Here are a few:
trim
: A logical value. If FALSE
, it adds spaces to right-justify the result. If TRUE
, it suppresses the leading spaces.
digits
: How many significant digits of numeric values to show.
nsmall
: The minimum number of digits after the decimal point.
In addition, you can control the format of the decimal point with decimal.mark
, the mark between intervals before the decimal point with big.mark
, as well as the mark between intervals after the decimal point with small.mark
.
For example, you can print the number 12345.6789 with a comma as decimal point, spaces as the big mark, and dots as the small mark:
> format(12345.6789, digits=9, decimal.mark=”,”,
+ big.mark=” “,small.mark=”.”, , small.interval=3)
[1] “12 345,678.9”
As a more practical example, to calculate the means of some columns in mtcars
and then print the results with two digits after the decimal point, use the following:
> x <- colMeans(mtcars[, 1:4])
> format(x, digits=2, nsmall=2)
mpg cyl disp hp
“ 20.09” “ 6.19” “230.72” “146.69”
Notice that the result is no longer a number but a text string. So, be careful when you use number formatting — this should be the last step in your reporting workflow.
If you’re familiar with programming in languages similar to C or C++, then you also may find the sprintf()
function useful, because sprintf()
is a wrapper around the C printf()
function. This wrapper allows you to paste your formatted number directly into a string.
Here’s an example of converting numbers into percentages:
> x <- seq(0.5, 0.55, 0.01)
> sprintf(“%.1f %%”, 100*x)
[1] “50.0 %” “51.0 %” “52.0 %” “53.0 %” “54.0 %” “55.0 %”
This bit of magic should be familiar to C programmers, but for the rest of us, this is what it does: The first argument to sprintf()
indicates the format — in this case, “%.1f %%”
. The format argument uses special literals that indicate that the function should replace this literal with a variable and apply some formatting. The literals always start with the %
symbol. So, in this case, %.1f
means to format the first supplied value as a fixed point value with one digit after the decimal point, and %%
is a literal that means print a %
.
To format some numbers as currency — in this case, U.S. dollars — use the following:
> set.seed(1)
> x <- 1000*runif(5)
> sprintf(“$ %3.2f”, x)
[1] “$ 265.51” “$ 372.12” “$ 572.85” “$ 908.21” “$ 201.68”
As you saw earlier, the literal %3.2f
means to format the value as a fixed point value with three digits before the decimal and two digits after the decimal.
The sprintf()
function is a lot more powerful than that: It gives you an alternative way of pasting the value of any variable into a string:
> stuff <- c(“bread”, “cookies”)
> price <- c(2.1, 4)
> sprintf(“%s costed $ %3.2f “, stuff, price)
[1] “bread costed $ 2.10 “ “cookies costed $ 4.00 “
What happens here is that, because you supplied two vectors (each with two elements) to sprintf()
, your result is a vector with two elements. R cycles through the elements and places them into the sprintf()
literals. Thus, %s
(indicating format the value as a string) gets the value “bread”
the first time and “cookies”
the second time.
You can do everything with paste()
and format()
that you can do with sprintf()
, so you don’t really ever need to use it. But when you do, it can simplify your code.
Sorting Data
To sort data in R, you use the sort()
or order()
functions (see Chapter 13).
To sort the data frame mtcars
in increasing or decreasing order of the column hp
, use the following:
> with(mtcars, mtcars[order(hp), ])
> with(mtcars, mtcars[order(hp, decreasing=TRUE), ])
Making Choices with If
Spreadsheets give you the ability to perform all kinds of “What if?” analyses. One way of doing this is to use the if()
function in a spreadsheet.
R also has the if()
function, but it’s mostly used for flow control in your scripts. Because you typically want to perform a calculation on an entire vector in R, it’s usually more appropriate to use the ifelse()
function (see Chapters 9 and 13).
Here’s an example of using ifelse()
to identify cars with high fuel efficiency in the dataset mtcars
:
> mtcars <- within(mtcars,
+ mpgClass <- ifelse(mpg < mean(mpg), “Low”, “High”))
> mtcars[mtcars$mpgClass == “High”, ]
Calculating Conditional Totals
Something else that you probably did a lot in Excel is calculating conditional sums and counts with the functions sumif()
and countif()
.
You can do the same thing in one of two ways in R:
Use ifelse()
(see the preceding section).
Simply calculate a summary statistics on a subset of your data.
Say you want to calculate a conditional mean of fuel efficiency in mtcars
. You do this with the mean()
function. Now, to get the fuel efficiency for cars either side of a threshold of 150 horsepower, try the following:
> with(mtcars, mean(mpg))
[1] 20.09062
> with(mtcars, mean(mpg[hp < 150]))
[1] 24.22353
> with(mtcars, mean(mpg[hp >= 150]))
[1] 15.40667
Counting the number of elements in a vector is identical to it asking about its length. This means that the Excel function countif()
has an R equivalent in length()
:
> with(mtcars, length(mpg[hp > 150]))
[1] 13
Transposing Columns or Rows
Sometimes you need to transpose your data from rows to columns or vice versa. In R, the function to transpose a matrix is t()
:
> x <- matrix(1:12, ncol=3)
> x
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
To get the transpose of a matrix, use t()
:
> t(x)
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
See how the transposition of cars is a character array:
> t(mtcars[1:4, ])
Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
mpg “21.0” “21.0” “22.8” “21.4”
cyl “6” “6” “4” “6”
disp “160” “160” “108” “258”
hp “110” “110” “ 93” “110”
drat “3.90” “3.90” “3.85” “3.08”
wt “2.620” “2.875” “2.320” “3.215”
qsec “16.46” “17.02” “18.61” “19.44”
vs “0” “0” “1” “1”
am “1” “1” “1” “0”
gear “4” “4” “4” “3”
carb “4” “4” “1” “1”
mpgClass “High” “High” “High” “High”
Finding Unique or Duplicated Values
To identify all the unique values in your data, use the unique()
function. Try finding the unique values of number of cylinders in mtcars
:
> unique(mtcars$cyl)
[1] 6 4 8
Sometimes you want to know which values of your data are duplicates. Depending on your situation, those duplicates will be valid, but sometimes duplicate entries may indicate data-entry problems.
The function to identify duplicate entries is duplicated()
. In the built in dataset iris
, there is a duplicated row in line 143. Try it yourself:
> dupes <- duplicated(iris)
> head(dupes)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
> which(dupes)
[1] 143
> iris[dupes, ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
143 5.8 2.7 5.1 1.9 virginica
Because the result of duplicated()
is a logical vector, you can use it as an index to remove rows from your data. To do this, use the negation operator — the exclamation point (as in !dupes
):
> iris[!dupes, ]
> nrow(iris[!dupes, ])
[1] 149
Working with Lookup Tables
In a spreadsheet application like Excel, you can create lookup tables with the functions vlookup
or a combination of index
and match
.
In R, it may be convenient to use merge()
(see Chapter 13) or match()
. The match()
function returns a vector with the positions of elements that match your lookup value.
For example, to find the location of the element “Toyota Corolla”
in the row names of mtcars
, try the following:
> index <- match(“Toyota Corolla”, rownames(mtcars))
> index
[1] 20
> mtcars[index, 1:4]
mpg cyl disp hp
Toyota Corolla 33.9 4 71.1 65
You can see that the index position is 20, and that the 20th row is indeed the row you’re looking for.
Working with Pivot Tables
In Excel, pivot tables are a useful tool for manipulating and analyzing data.
For simple tables in R, you can use the tapply()
function to achieve similar results. Here’s an example of using tapply()
to calculate mean hp
for cars with different numbers of cylinders and gears:
> with(mtcars, tapply(hp, list(cyl, gear), mean))
3 4 5
4 97.0000 76.0 102.0
6 107.5000 116.5 175.0
8 194.1667 NA 299.5
For slightly more complex tables — that is, tables with more than two cross-classifying factors — use the aggregate()
function:
> aggregate(hp~cyl+gear+am, mtcars, mean)
cyl gear am hp
1 4 3 0 97.00000
2 6 3 0 107.50000
3 8 3 0 194.16667
4 4 4 0 78.50000
5 6 4 0 123.00000
6 4 4 1 75.16667
7 6 4 1 110.00000
8 4 5 1 102.00000
9 6 5 1 175.00000
10 8 5 1 299.50000
Using the Goal Seek and Solver
One very powerful feature of Excel is that it has a very easy-to-use solver that allows you to find minimum or maximum values for functions given some constraints.
A very large body of mathematics aims to solve optimization problems of all kinds. In R, the optimize()
function provides one fairly simple mechanism for optimizing functions.
Imagine you’re the sales director of a company and you need to set the best price for your product. In other words, find the price of a product that maximizes revenue.
In economics, a simple model of pricing states that people buy less of a given product when the price increases. Here’s a very simple function that has this behavior:
> sales <- function(price) { 100 - 0.5 * price }
Expected revenue is then simply the product of price and expected sales:
> revenue <- function(price) { price * sales(price) }
You can use the curve()
function to plot continuous functions. This takes a function as input and produces a plot. Try to plot the behavior of sales and revenue using the curve()
function, varying price from $50 to $150:
> par(mfrow=c(1, 2))
> curve(sales, from=50, to=150, xname=”price”, ylab=”Sales”, main=”Sales”)
> curve(revenue, from=50, to=150, xname=”price”, ylab=”Revenue”, main=”Revenue”)
> par(mfrow=c(1, 1))
Your results should look similar to Figure 19-1.
Figure 19-1: A model of expected sales and revenue.
You have a working model of sales and revenue. From the figure, you can see immediately that there is a point of maximum revenue. Next, use the R function optimize()
to find the value of that maximum. To use optimize()
, you need to tell it which function to use (in this case, revenue()
), as well as the interval (in this case, prices between 50 and 150). By default, optimize()
searches for a minimum value, so in this case you have to tell it to search for maximum value:
> optimize(revenue, interval=c(50, 150), maximum=TRUE)
$maximum
[1] 100
$objective
[1] 5000
And there you go. Charge a price of $100, and expect to get $5,000 in revenue.