Chapter 1
Introducing R: The Big Picture
In This Chapter
Discovering the benefits of R
Identifying some programming concepts that make R special
With an estimated worldwide user base of more than 2 million people, the R language has rapidly grown and extended since its origin as an academic demonstration language in the 1990s.
Some people would argue — and we think they’re right — that R is much more than a statistical programming language. It’s also:
A very powerful tool for all kinds of data processing and manipulation
A community of programmers, users, academics, and practitioners
A tool that makes all kinds of publication-quality graphics and data visualizations
A collection of freely distributed add-on packages
A toolbox with tremendous versatility
In this chapter, we fill you in on the benefits of R, as well as its unique features and quirks.
Recognizing the Benefits of Using R
Of the many attractive benefits of R, a few stand out: It’s actively maintained, it has good connectivity to various types of data and other systems, and it’s versatile enough to solve problems in many domains. Possibly best of all, it’s available for free, in more than one sense of the word.
It comes as free, open-source code
R is available under an open-source license, which means that anyone can download and modify the code. This freedom is often referred to as “free as in speech.” R is also available free of charge — a second kind of freedom, sometimes referred to as “free as in beer.” In practical terms, this means that you can download and use R free of charge.
Another benefit, albeit slightly more indirect, is that anybody can access the source code, modify it, and improve it. As a result, many excellent programmers have contributed improvements and fixes to the R code. For this reason, R is very stable and reliable.
It runs anywhere
The R Development Core Team has put a lot of effort into making R available for different types of hardware and software. This means that R is available for Windows, Unix systems (such as Linux), and the Mac.
It supports extensions
R itself is a powerful language that performs a wide variety of functions, such as data manipulation, statistical modeling, and graphics. One really big advantage of R, however, is its extensibility. Developers can easily write their own software and distribute it in the form of add-on packages. Because of the relative ease of creating these packages, literally thousands of them exist. In fact, many new (and not-so-new) statistical methods are published with an R package attached.
It provides an engaged community
The R user base keeps growing. Many people who use R eventually start helping new users and advocating the use of R in their workplaces and professional circles. Sometimes they also become active on the R mailing lists (www.r-project.org/mail.html
) or question-and-answer (Q&A) websites such as Stack Overflow, a programming Q&A website (www.stackoverflow.com/questions/tagged/r
) and CrossValidated, a statistics Q&A website (http://stats.stackexchange.com/questions/tagged/r
). In addition to these mailing lists and Q&A websites, R users participate in social networks such as Twitter (www.twitter.com/search/rstats
) and regional R conferences. (See Chapter 11 for more information on R communities.)
It connects with other languages
As more and more people moved to R for their analyses, they started trying to combine R with their previous workflows, which led to a whole set of packages for linking R to file systems, databases, and other applications. Many of these packages have since been incorporated into the base installation of R.
For example, the R package foreign
(http://cran.r-project.org/web/packages/foreign/index.html
) is part of the standard R distribution and enables you to read data from the statistical packages SPSS, SAS, Stata, and others (see Chapter 12).
Several add-on packages exist to connect R to database systems, such as the RODBC
package, to read from databases using the Open Database Connectivity protocol (ODBC) (http://cran.r-project.org/web/packages/RODBC/index.html
), and the ROracle
package, to read Oracle data bases (http://cran.r-project.org/web/packages/ROracle/index.html
).
Because many statisticians also worked with commercial programs, the R Development Core Team (and others) wrote tools to read data from those programs, including SAS Institute’s SAS and IBM’s SPSS. By now, many of the big commercial packages have add-ons to connect with R. Notably, SPSS has incorporated a link to R for its users, and SAS has numerous protocols that show you how to move data and graphics between the two packages.
Looking At Some of the Unique Features of R
R is more than just a domain-specific programming language aimed at statisticians. It has some unique features that make it very powerful, including the notion of vectors, which means that you can make calculations on many values at the same time.
Performing multiple calculations with vectors
R is a vector-based language. You can think of a vector as a row or column of numbers or text. The list of numbers {1,2,3,4,5}
, for example, could be a vector. Unlike most other programming languages, R allows you to apply functions to the whole vector in a single operation without the need for an explicit loop.
We’ll illustrate with some real R code. First, we’ll assign the values 1:5
to a vector that we’ll call x
:
> x <- 1:5
> x
[1] 1 2 3 4 5
Next, we’ll add the value 2
to each element in the vector x
and print the result:
> x + 2
[1] 3 4 5 6 7
You can also add one vector to another. To add the values 6:10
element-wise to x
, you do the following:
> x + 6:10
[1] 7 9 11 13 15
To do this in most other programming language would require an explicit loop to run through each value of x
.
This feature of R is extremely powerful because it lets you perform many operations in a single step. In programming languages that aren’t vectorized, you’d have to program a loop to achieve the same result.
We introduce the concept of vectors in Chapter 2 and expand on vectors and vectorization in much more depth in Chapter 4.
Processing more than just statistics
R was developed by statisticians to make statistical processing easier. This heritage continues, making R a very powerful tool for performing virtually any statistical computation.
As R started to expand away from its origins in statistics, many people who would describe themselves as programmers rather than statisticians have become involved with R. The result is that R is now eminently suitable for a wide variety of nonstatistical tasks, including data processing, graphic visualization, and analysis of all sorts. R is being used in the fields of finance, natural language processing, genetics, biology, and market research, to name just a few.
In this book, we assume that you want to find out about R programming, not statistics, although we provide an introduction to statistics in R in Part IV.
Running code without a compiler
R is an interpreted language, which means that — contrary to compiled languages like C and Java — you don’t need a compiler to first create a program from your code before you can use it. R interprets the code you provide directly and converts it into lower-level calls to pre-compiled code/functions.
In practice, it means that you simply write your code and send it to R, and the code runs, which makes the development cycle easy. This ease of development comes at the cost of speed of code execution, however. The downside of an interpreted language is that the code usually runs slower than compiled code runs.