Chapter 19

Ten Things You Can Do in R That You Would’ve Done in Microsoft Excel

In This Chapter

arrow Performing calculations and other operations on data

arrow Going beyond what you can do in a spreadsheet

The spreadsheet is probably one of the most widely used PC applications — and for good reason: Spreadsheets make it very easy to perform calculations and other operations on tabular data. But spreadsheets pose some risks as well: They’re easy to corrupt and very difficult to debug.

The good news is, you can use R to do many of the same things you used to do in spreadsheets. In R, you use data frames to represent tabular data. R has many functions, operators, and methods that allow you to manipulate and calculate on data frames. This means that you can do just about anything (and more) in R that you would’ve done in Microsoft Excel, LibreOffice Calc, or your favorite spreadsheet application.

In this chapter, we offer some pointers on functions that you can explore in R, most of which are covered earlier in this book. In most cases, we provide some sample code but not the results. Try these examples yourself, and remember to use the R Help documentation to find out more about these functions.

Adding Row and Column Totals

One task that you may frequently do in a spreadsheet is calculating row or column totals. The easiest way to do this is to use the functions rowSums() and colSums(). Similarly, use the functions rowMeans() and colMeans() to calculate means.

Try it on the built-in dataset iris. First, remove the fifth column, because it contains text that describes the species of iris:

> iris.num <- iris[, -5]

Then calculate the sum and mean for each column:

> colSums(iris.num)

> colMeans(iris.num)

These two functions are very convenient, but you may want to calculate some other statistic for each column or row. There’s an easy way of traversing rows or columns of an array or data frame: the apply() function (see Chapter 13). For example, getting the minimum of a column is the same as applying a min() function to the second dimension of your data:

> apply(iris.num, 2, min)

> apply(iris.num, 2, max)

tip.eps The apply() function is ideal when your data is in an array and will apply happily over both rows and columns. For the special case where your data is in a data frame and you want to get column summaries, you’re better off using sapply() rather than apply(). So, to get your iris column summaries, try this instead:

> sapply(iris.num, min)

> sapply(iris.num, max)

Formatting Numbers

When you produce reports, you want your numbers to appear all nicely formatted. For example, you may want to align numbers on the decimal points or specify the width of the column. Or you may want to print your number with a currency symbol ($100.00) or append a percentage sign to it (35.7%).

You can use format() to turn your numbers into pretty text, ready for printing. This function takes a number of arguments to control the format of your result. Here are a few:

check.png trim: A logical value. If FALSE, it adds spaces to right-justify the result. If TRUE, it suppresses the leading spaces.

check.png digits: How many significant digits of numeric values to show.

check.png nsmall: The minimum number of digits after the decimal point.

In addition, you can control the format of the decimal point with decimal.mark, the mark between intervals before the decimal point with big.mark, as well as the mark between intervals after the decimal point with small.mark.

For example, you can print the number 12345.6789 with a comma as decimal point, spaces as the big mark, and dots as the small mark:

> format(12345.6789, digits=9, decimal.mark=”,”,

+     big.mark=” “,small.mark=”.”, , small.interval=3)

[1] “12 345,678.9”

As a more practical example, to calculate the means of some columns in mtcars and then print the results with two digits after the decimal point, use the following:

> x <- colMeans(mtcars[, 1:4])

> format(x, digits=2, nsmall=2)

     mpg      cyl     disp       hp

“ 20.09” “  6.19” “230.72” “146.69”

Notice that the result is no longer a number but a text string. So, be careful when you use number formatting — this should be the last step in your reporting workflow.

If you’re familiar with programming in languages similar to C or C++, then you also may find the sprintf() function useful, because sprintf() is a wrapper around the C printf() function. This wrapper allows you to paste your formatted number directly into a string.

Here’s an example of converting numbers into percentages:

> x <- seq(0.5, 0.55, 0.01)

> sprintf(“%.1f %%”, 100*x)

[1] “50.0 %” “51.0 %” “52.0 %” “53.0 %” “54.0 %” “55.0 %”

This bit of magic should be familiar to C programmers, but for the rest of us, this is what it does: The first argument to sprintf() indicates the format — in this case, “%.1f %%”. The format argument uses special literals that indicate that the function should replace this literal with a variable and apply some formatting. The literals always start with the % symbol. So, in this case, %.1f means to format the first supplied value as a fixed point value with one digit after the decimal point, and %% is a literal that means print a %.

To format some numbers as currency — in this case, U.S. dollars — use the following:

> set.seed(1)

> x <- 1000*runif(5)

> sprintf(“$ %3.2f”, x)

[1] “$ 265.51” “$ 372.12” “$ 572.85” “$ 908.21” “$ 201.68”

As you saw earlier, the literal %3.2f means to format the value as a fixed point value with three digits before the decimal and two digits after the decimal.

The sprintf() function is a lot more powerful than that: It gives you an alternative way of pasting the value of any variable into a string:

> stuff <- c(“bread”, “cookies”)

> price <- c(2.1, 4)

> sprintf(“%s costed $ %3.2f “, stuff, price)

[1] “bread costed $ 2.10 “   “cookies costed $ 4.00 “

What happens here is that, because you supplied two vectors (each with two elements) to sprintf(), your result is a vector with two elements. R cycles through the elements and places them into the sprintf() literals. Thus, %s (indicating format the value as a string) gets the value “bread” the first time and “cookies” the second time.

You can do everything with paste() and format() that you can do with sprintf(), so you don’t really ever need to use it. But when you do, it can simplify your code.

Sorting Data

To sort data in R, you use the sort() or order() functions (see Chapter 13).

To sort the data frame mtcars in increasing or decreasing order of the column hp, use the following:

> with(mtcars, mtcars[order(hp), ])

> with(mtcars, mtcars[order(hp, decreasing=TRUE), ])

Making Choices with If

Spreadsheets give you the ability to perform all kinds of “What if?” analyses. One way of doing this is to use the if() function in a spreadsheet.

R also has the if() function, but it’s mostly used for flow control in your scripts. Because you typically want to perform a calculation on an entire vector in R, it’s usually more appropriate to use the ifelse() function (see Chapters 9 and 13).

Here’s an example of using ifelse() to identify cars with high fuel efficiency in the dataset mtcars:

> mtcars <- within(mtcars,

+     mpgClass <- ifelse(mpg < mean(mpg), “Low”, “High”))

> mtcars[mtcars$mpgClass == “High”, ]

Calculating Conditional Totals

Something else that you probably did a lot in Excel is calculating conditional sums and counts with the functions sumif() and countif().

You can do the same thing in one of two ways in R:

check.png Use ifelse() (see the preceding section).

check.png Simply calculate a summary statistics on a subset of your data.

Say you want to calculate a conditional mean of fuel efficiency in mtcars. You do this with the mean() function. Now, to get the fuel efficiency for cars either side of a threshold of 150 horsepower, try the following:

> with(mtcars, mean(mpg))

[1] 20.09062

> with(mtcars, mean(mpg[hp < 150]))

[1] 24.22353

> with(mtcars, mean(mpg[hp >= 150]))

[1] 15.40667

Counting the number of elements in a vector is identical to it asking about its length. This means that the Excel function countif() has an R equivalent in length():

> with(mtcars, length(mpg[hp > 150]))

[1] 13

Transposing Columns or Rows

Sometimes you need to transpose your data from rows to columns or vice versa. In R, the function to transpose a matrix is t():

> x <- matrix(1:12, ncol=3)

> x

     [,1] [,2] [,3]

[1,]    1    5    9

[2,]    2    6   10

[3,]    3    7   11

[4,]    4    8   12

To get the transpose of a matrix, use t():

> t(x)

     [,1] [,2] [,3] [,4]

[1,]    1    2    3    4

[2,]    5    6    7    8

[3,]    9   10   11   12

warning_bomb.eps You also can use t() to transpose data frames, but be careful when you do this. The result of a transposition is always a matrix (or array). Because arrays always have only one type of variable, such as numeric or character, the variable types of your results may not be what you expect.

See how the transposition of cars is a character array:

> t(mtcars[1:4, ])

         Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive

mpg      “21.0”    “21.0”        “22.8”     “21.4”

cyl      “6”       “6”           “4”        “6”

disp     “160”     “160”         “108”      “258”

hp       “110”     “110”         “ 93”      “110”

drat     “3.90”    “3.90”        “3.85”     “3.08”

wt       “2.620”   “2.875”       “2.320”    “3.215”

qsec     “16.46”   “17.02”       “18.61”    “19.44”

vs       “0”       “0”           “1”        “1”

am       “1”       “1”           “1”        “0”

gear     “4”       “4”           “4”        “3”

carb     “4”       “4”           “1”        “1”

mpgClass “High”    “High”        “High”     “High”

Finding Unique or Duplicated Values

To identify all the unique values in your data, use the unique() function. Try finding the unique values of number of cylinders in mtcars:

> unique(mtcars$cyl)

[1] 6 4 8

Sometimes you want to know which values of your data are duplicates. Depending on your situation, those duplicates will be valid, but sometimes duplicate entries may indicate data-entry problems.

The function to identify duplicate entries is duplicated(). In the built in dataset iris, there is a duplicated row in line 143. Try it yourself:

> dupes <- duplicated(iris)

> head(dupes)

[1] FALSE FALSE FALSE FALSE FALSE FALSE

> which(dupes)

[1] 143

> iris[dupes, ]

    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species

143          5.8         2.7          5.1         1.9 virginica

Because the result of duplicated() is a logical vector, you can use it as an index to remove rows from your data. To do this, use the negation operator — the exclamation point (as in !dupes):

> iris[!dupes, ]

> nrow(iris[!dupes, ])

[1] 149

Working with Lookup Tables

In a spreadsheet application like Excel, you can create lookup tables with the functions vlookup or a combination of index and match.

In R, it may be convenient to use merge() (see Chapter 13) or match(). The match() function returns a vector with the positions of elements that match your lookup value.

For example, to find the location of the element “Toyota Corolla” in the row names of mtcars, try the following:

> index <- match(“Toyota Corolla”, rownames(mtcars))

> index

[1] 20

> mtcars[index, 1:4]

                mpg cyl disp hp

Toyota Corolla 33.9   4 71.1 65

You can see that the index position is 20, and that the 20th row is indeed the row you’re looking for.

Working with Pivot Tables

In Excel, pivot tables are a useful tool for manipulating and analyzing data.

For simple tables in R, you can use the tapply() function to achieve similar results. Here’s an example of using tapply() to calculate mean hp for cars with different numbers of cylinders and gears:

> with(mtcars, tapply(hp, list(cyl, gear), mean))

         3     4     5

4  97.0000  76.0 102.0

6 107.5000 116.5 175.0

8 194.1667    NA 299.5

For slightly more complex tables — that is, tables with more than two cross-classifying factors — use the aggregate() function:

> aggregate(hp~cyl+gear+am, mtcars, mean)

   cyl gear am        hp

1    4    3  0  97.00000

2    6    3  0 107.50000

3    8    3  0 194.16667

4    4    4  0  78.50000

5    6    4  0 123.00000

6    4    4  1  75.16667

7    6    4  1 110.00000

8    4    5  1 102.00000

9    6    5  1 175.00000

10   8    5  1 299.50000

tip.eps If you frequently work with tables in Excel, you should definitely explore the packages plyr and reshape2 that are available on CRAN at http://cran.r-project.org/web/packages/plyr/ and http://cran.r-project.org/web/packages/reshape2/, respectively. These packages provide a number of functions for common data manipulation problems.

Using the Goal Seek and Solver

One very powerful feature of Excel is that it has a very easy-to-use solver that allows you to find minimum or maximum values for functions given some constraints.

A very large body of mathematics aims to solve optimization problems of all kinds. In R, the optimize() function provides one fairly simple mechanism for optimizing functions.

Imagine you’re the sales director of a company and you need to set the best price for your product. In other words, find the price of a product that maximizes revenue.

In economics, a simple model of pricing states that people buy less of a given product when the price increases. Here’s a very simple function that has this behavior:

> sales <- function(price) { 100 - 0.5 * price }

Expected revenue is then simply the product of price and expected sales:

> revenue <- function(price) { price * sales(price) }

You can use the curve() function to plot continuous functions. This takes a function as input and produces a plot. Try to plot the behavior of sales and revenue using the curve() function, varying price from $50 to $150:

> par(mfrow=c(1, 2))

> curve(sales, from=50, to=150, xname=”price”, ylab=”Sales”, main=”Sales”)

> curve(revenue, from=50, to=150, xname=”price”, ylab=”Revenue”, main=”Revenue”)

> par(mfrow=c(1, 1))

Your results should look similar to Figure 19-1.

Figure 19-1: A model of expected sales and revenue.

9781119963134-fg1901.tif

You have a working model of sales and revenue. From the figure, you can see immediately that there is a point of maximum revenue. Next, use the R function optimize() to find the value of that maximum. To use optimize(), you need to tell it which function to use (in this case, revenue()), as well as the interval (in this case, prices between 50 and 150). By default, optimize() searches for a minimum value, so in this case you have to tell it to search for maximum value:

> optimize(revenue, interval=c(50, 150), maximum=TRUE)

$maximum

[1] 100

$objective

[1] 5000

And there you go. Charge a price of $100, and expect to get $5,000 in revenue.

technicalstuff.eps The Excel Solver uses the Generalized Reduced Gradient Algorithm for optimizing nonlinear problems (http://support.microsoft.com/kb/214115). The R function optimize() uses a combination of golden section search and successive parabolic interpolation, which clearly is not the same thing as the Excel Solver. Fortunately, a large number of packages provide various different algorithms for solving optimization problems. In fact, there is a special task view on CRAN for optimization and mathematical programming. Go to http://cran.r-project.org/web/views/Optimization.html to find out more than you ever thought you wanted to know!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset