Hour 1. The R Community


What You’ll Learn in This Hour:

Image A brief history of S and R

Image An overview of the R community

Image The development and release of R versions


In this hour we start by looking at how R evolved from the S language to become the all-purpose data science programming language that it is today. It is important when learning any programming language to understand a little about where it came from and why it functions as it does. This is particularly relevant for R because many of the quirkier aspects of the language have roots in S.

As a free and open-source programming language, R relies strongly on community input. The R community offers a plethora of help and support options for users. We look at some of the better-known options during this hour. Toward the end of the hour we look a little closer at the development and release of R versions.

A Concise History of R

When I first started teaching introductory R courses, I would ask how many people in the room had any experience with S. This was an important question for an R training course because the languages are syntactically similar. If you know S, then what are you doing in an Introduction to R course?! A couple of years ago, the number of raised hands had dropped significantly, and I revised this question to ask, “How many people here have actually heard of S?” Today, very few people have but to begin to understand R, so it helps to know just where it came from, and that means knowing what S is and how it came to be.

The Birth of S

S was initially developed at AT&T Bell Laboratories by John Chambers in the mid-to-late 1970s—a time that predates Google and the need to be able to search for help concerning your programming language! John Chambers’ original idea is beautifully portrayed in the now infamous sketch from 1976, shown in Figure 1.1. The essence of Chambers’ idea was that his then-unnamed language would provide an accessible interface to lower-level Fortran subroutines, thereby reducing the time a statistician would have to spend coding. Today, languages such as R, SAS, Matlab, and Python all take a similar approach, but at the time this idea was fairly ground-breaking.

Image

FIGURE 1.1 John Chambers’ sketch of the idea that became S

The name “S” stands for “Statistics.” It was chosen over other names primarily for consistency (the C language was also born out of Bell Laboratories a few years earlier) and because pretty much every name proposed began with the letter S. One name in particular, SAS (Statistical Analysis Software), had already been taken.

The S language continued to grow and evolve with several key changes that shaped both the S language and eventually R today. These included a gradual transition toward C for internal routines, a switch from macros to functions, and the introduction of the “S3” and then the “S4” class systems, which are described in Hour 21, “Writing R Classes,” and Hour 22, “Formal Class Systems.”

A particularly important milestone in the life of S was the development and release of the first version of S-PLUS by Statistical Sciences, Inc., in 1988. In the next few years, Statistical Sciences built a new graphical user interface for S and added interactive graphical capabilities by integrating the GUI with their Axum product. They also added connectors to a number of Microsoft products, such as Excel and PowerPoint. However, perhaps most significant of all was that in 1993 Statistical Sciences acquired the exclusive license to market and distribute the S language, closing off the development of S to outsiders. TIBCO acquired the then-owners of S-PLUS, Insightful, in 2008. However, to date, no new versions of S-PLUS have been released since the acquisition, with TIBCO turning their attentions toward R and becoming a founding member of the R Consortium in 2015.

The Birth of R

Earlier in this hour we said that S and R were “syntactically similar.” The main R Project website for R, www.r-project.org, does not shy away from the relationship with S, describing R as “similar to the S language and environment” and claiming that “much code written for S runs unaltered under R.” It does not go as far as saying that R is a copy or reimplementation of S, but R is widely considered to have evolved from S. The near-identical syntax is no coincidence! The first version of the R language was developed by Robert Gentleman and Ross Ihaka of The University of Aukland in the mid-late 1990s. The name “R” is a play on the names Robert and Ross, though the significance of the position of the letter R next to S in the alphabet should not be downplayed.

Robert and Ross were soon joined by a core group of contributors known as the “R Development Core Team,” which is today responsible for the development and release of new R versions. Following the release of R-1.5.0, the core members created “The R Foundation,” which, among other things, is responsible for copyright and documentation of R. The R Foundation now contains many of the original S development team, including John Chambers.

R has undergone many iterations of its own since the early days, with minor releases approximately every 3 months. However, much of the functionality, particularly the core statistic routines, resembles the S language of old.

The R Community

Before we install R and begin programming, we would like to highlight some of the available online resources for R. Indeed, there are many online resources, almost all of which can be accessed via the main R project website (see Figure 1.2). From here you can download the latest copy of R, download R packages, find help on R, join several R mailing lists, search for R books such as this, and find events.

Image

FIGURE 1.2 The main page of the R Project website, www.r-project.org

A big difference between the open-source R language and commercially supported software such as SAS and SPSS is the large and active online community that has built up around R. Like many open-source communities, the R community is a weird yet wonderful beast that takes some getting used to! However, one of the goals of a group formed in 2015, known as the R Consortium, is to try to make R more accessible for newcomers to the language.

Mailing Lists

Several mailing lists are dedicated to R, each listed on the R Project website. The first port of call for most new users is the R-help mailing list. My advice to any newcomer is to use the searchable archives on the R Project website (and read the posting guide) before posting any help requests to the community because chances are someone else has had the same issue before. If you do use R-help, what you will first notice is the speed at which users are rushing to help you out; night and day the community is waiting to embrace your R challenge. On the flip side, do beware of making critical remarks about the behavior of a function or quality of the documentation. The chances are the author is reading your post with no sales or marketing team sitting next to him telling you to be kind!

R Manuals

A typical response to an R-help request used to be “read the manuals.” Like the language itself, the R manuals, of which there are several, have their roots in S. If you do decide to consult them for help, we can promise you that the information you’re looking for will be there. In particular, the “Writing R Extensions” manual is a very handy reference for those wanting to develop and deploy R packages for mass consumption. However, unless you are already very familiar with general programming constructs such as object orientation, and are therefore ready to jump in at the deep end, you may find the manuals hard going. The R Core Team recognizes this, and the “An Introduction to R” manual contains a subsection within the preface titled, “Suggestions to the reader” where the advice for R novices is essentially to skip the first 80 pages and “start with the introductory session in Appendix A”!

Online Resources

Plenty of online resources are available, although they are not always easy to find for the R newcomer. I’ve been using R for nearly 15 years, yet when I type R and a space into Google, it still thinks I’m looking for R. Kelly! Generally, though, there is enough of a divide between the worlds of R&B and of statistical programming to make Googling for R help fairly straightforward. Besides Google, there are a number of other options for searching for R-based material, some of which are listed on the R Project website. In particular, Sasha Goodman of Stanford University has created Rseek (http://rseek.org/), which searches several known R-related sites.

If you wish to search the manuals for help, you can do so directly using a tool called R Documentation, http://www.rdocumentation.org, developed by DataCamp. R Documentation is a website that pulls together documentation from the main R repositories into a single location. The website also offers the ability to search the Comprehensive R Archive Network’s (CRAN’s) Task Views for packages of code. We will discuss CRAN and R packages in greater detail during Hour 2, “The R Environment.”

The R Consortium

On June 30, 2015, the Linux Foundation launched the R Consortium. The R Consortium consists primarily of data scientists from both industry and academia with the joint goal of trying to advance the R language and support the growth of the R community. The home page for the R Consortium is shown in Figure 1.3. Existing members of the R Foundation were joined by founding members Microsoft and RStudio (Platinum); TIBCO Software, Inc. (Gold); and Alteryx, Google, HP, Mango Solutions, Ketchum Trading, and Oracle (Silver).

Image

FIGURE 1.3 The home page of the R Consortium, www.r-consortium.org

The R Consortium is still very much in its infancy, but it is anticipated that its formation will both improve the accessibility of the R language and oversee its next phase of growth. The R Consortium home page may soon replace the R Project home page as the go-to starting point for the R community.

User Events

Another great plus of the open-source community is the number of user events available to attend globally. New user groups are popping up all the time, and attendance numbers can vary from 5 to 500. Events are typically held in the evening, with participants giving up their own time to attend. Since the very early days of R, these user meetings have been a primary arena for R enthusiasts to meet and share ideas. Many of the more established meetings receive commercial backing.

In addition to the localized R meetings, the main “useR!” conference has been held regularly since 2004, with the number of attendees steadily increasing year over year. The conference is generally focused on developments in the R language and R packages. It is packed with presentations from academia and industry and is now backed by the R Consortium. In 2014, UseR! was joined by the Effective Applications of the R Language (EARL) conference. The primary focus of the EARL conference is the commercial usage of R across a range of industry sectors with the aim of sharing knowledge and applications of the language.

In addition to the cross-sector R conferences, there are also industry-specific R conferences for those working in either the finance or insurance industry. These are, respectively, R/Finance, which has been held annually in Chicago since 2009, and R in Insurance, which has been running annually since 2013.

R Development

Today, the R Development Core Team still controls the write-access to the R source (though as an open-source GNU project, this source code is freely available to download for anyone who wants to see it). However, much of the popularity of the R language today can also be attributed to the many contributors outside of that group who have written one or more of several thousand R “packages,” freely available for download from the CRAN repository. CRAN is a network of ftp and web servers mirrored around the world, each of which contains versions of R and the contributed R packages.

The scope and quality of the R packages can vary greatly, but finding and using new R packages is an important part of the life of the modern R user. A proactive statistician or data scientist may have several hundred packages installed on his or her local machine for any particular version of R. R packages are explained in more detail in Hour 2.

Versions of R

The R Core Development Team decides when new versions of R are ready for general public release. Each release comes with a comprehensive description of additional features and fixes since the previous version. R versions follow the Major-Minor-Patch structure (for example, R-3.2.0). The first version of R, R-1.0.0 was released in February 2000, with a steady release pattern of patch, minor and very occasionally major releases, since then. In recent years the rate of release has slowed a little, with minor versions of R released approximately annually. Historically, each new minor release has had two to three associated patch versions.


Note: Nicknames

R version 2.15.1 was the first R release to be given a “nickname,” Roasted Marshmallows, by the R Core Development Team. Every subsequent R version has been given an interesting but apparently random nickname. This nickname is printed on start-up but can also be accessed by running the line R.Version()$nickname.


If you have a background in software such as SAS or Microsoft Excel, you may wonder why R versions are released so frequently. There is often a concern that the high frequency of releases is a sign of instability and that R is very buggy. Actually the opposite is true; however, commercial organizations do tend to be cautious about both the R versions that they adopt and the frequency with which they adopt them. Often companies wait until the second or third patched version of a minor release, such as R-3.1.2, before upgrading their R environment.

If you do ever identify a bug in R, it is very simple and easy to report it by emailing the package maintainer. Unlike most commercially backed closed-source models, the open model allows a direct dialogue with the person developing the code. Once it has been established as a genuine bug, you can work with the maintainer on a solution and in some cases gain recognition as a package author for your efforts. Once a resolution has been established to the issue, your bug-fix is usually implemented in the next patched or minor release. This means you typically never have to wait more than a couple of months for a bug to be fixed.

Summary

During this hour you were presented with a brief history of the evolution of S and then R. Along the way you heard terms such as “S3” and “S4,” deriving from S, which will be mentioned at various points throughout the remaining hours and covered specifically during Hours 21 and 22.

You were introduced to the R community and the various groups that support the R language: the R Core Development Team, the R Foundation, and the R Consortium. We looked at a selection of the available online resources and touched on the difficulties of searching for R help. Finally, we discussed the development cycle of R and what it means for bugs in the code.

In the “Activities” section, we install R and the RStudio integrated development environment (IDE). In the next hour, we will begin to use and explore R through the RStudio IDE.

Q&A

Q. With so many versions of R, should I be worried about backward compatibility?

A. If we consider the base R language and ignore the many thousand additional packages available to download from CRAN, it is fair to say that R is pretty backward compatible. Indeed there are many features of R today that exist due to decisions made when the S language was developed. However, the same cannot be said for the thousands of contributed packages residing in the main CRAN repository. Even some of the best known and respected R package authors change their mind from time to time, and package version numbers can make a big difference. Ensuring quality and consistency across R packages is one of the biggest challenges facing the R Foundation today.

Q. A colleague of mine has sent me a bunch of S scripts. Will they run in R?

A. The official line is, “There are some important differences, but much code written for S runs unaltered under R.” This is very much the case for day-to-day code, with a few notable exceptions. The function for calculating the standard deviation in S is stdev compared with sd in R, for example. For slightly more advanced users, functional scoping can become an issue (one of the “important differences”), but in essence the official line is spot on. To the naked eye, S and R code look very similar indeed.

Workshop

The workshop contains quiz questions and exercises to help you solidify your understanding of the material covered. Try to answer all questions before looking at the “Answers” section that follows.

Quiz

1. Which “similar” programming language predated R?

2. What does the acronym CRAN stand for?

3. Which group of R enthusiasts controls the write-access to the R source and is responsible for the distribution of the R language?

A. The R Core Development Team

B. The R Foundation

C. The R Consortium

Answers

1. The S language.

2. Comprehensive R Archive Network.

3. The R Core Development Team is directly responsible, though the resources and support surrounding each release could also be considered the responsibility of the R Foundation or R Consortium.

Activities

1. Refer to the “Installing R” section of this book’s Appendix. Download and install the appropriate version of R for your operating system.

2. Refer to the “Installing RStudio” section of this book’s Appendix. Download and install the latest version of RStudio Desktop from the RStudio website.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset