Managing the library of packages

In R, packages play an indispensable role in data analysis and visualization. In fact, R itself is only a tiny core and is built on several basic packages. A package is a container of predefined functions, which are often designed to be general enough to solve a certain range of problems. Using a well-designed package, we don't have to reinvent the wheel again and again, which allows us to focus more on the problem we are trying to solve.

R is powerful not only because of its rich source of packages, but also because of the well-maintained package archive system called The Comprehensive R Archive Network, or CRAN (http://cran.r-project.org/). The source code of R and thousands of packages is archived in this system. At the time of writing, there are 7,750 active packages on CRAN maintained by more than 4,500 package maintainers around the world. Every week, more than 100 packages will be updated and more than 2 million package downloads happen. You can check out the table of packages at https://cran.rstudio.com/web/packages/ in which all the packages currently available are listed.

Just don't panic when you hear the number of packages on CRAN! The number is large and the coverage is wide, but you only have to learn a small fraction of them. If you focus on the work of a specific field, it is very likely that there are no more than 10 packages that are heavily related to your work and field. Therefore, there's absolutely no need for you to know all the packages (nobody can or even need to), but only the most useful and field-related ones.

Instead of finding packages in the table, which is not that informative, I recommend that you visit CRAN Task Views (https://cran.rstudio.com/web/views/) and METACRAN http://www.r-pkg.org/, and get started by learning about the packages that are most commonly used or closely related to your working field. Before learning how to use a specific package, we need to have a general idea about installing packages from different sources and understand how a package basically works.

Getting to know a package

A package is a collection of functions to solve a certain range of problems. It can be an implementation of a family of statistical estimators, data-mining methods, database interfaces, or optimization tools. To know more about a package, for example, ggplot2, a super powerful graphics package, several information sources are useful:

  • Package description page (https://cran.rstudio.com/web/packages/ggplot2/): The page contains the basic information about the package, including the name, description, version, publishing date, authors, related websites, reference manuals, vignettes, relationship to other packages, and so on. The description page of a package is provided not only by CRAN, but by some other third-party package information websites. METACRAN also provides a description of ggplot2 at http://www.r-pkg.org/pkg/ggplot2.
  • Package website (http://ggplot2.org/): The webpage contains a description and related resources for the package, such as blogs, tutorials, and books. Not every package has a website, but if one does the website is the official starting point for learning about the package.
  • Package source code (https://github.com/hadley/ggplot2): The authors host the source code of the package on GitHub (https://github.com), and the page is the source code the package. If you are interested in the implementation of the package functions, you can check out the source code and take a look. If you find some unexpected behavior that looks like a bug, you can report it at https://github.com/hadley/ggplot2/issues. Also, you can file an issue at the same place to request a new feature.

After reading the package description, you can try it by installing the package to the R library.

Installing packages from CRAN

CRAN archives R packages and distributes them to more than 120 mirrors around the world. You can visit CRAN Mirrors (https://cran.r-project.org/mirrors.html) and check out a nearby mirror. If you find one, you can go to ToolsGlobal Options and open the following dialog:

Installing packages from CRAN

You can change the CRAN mirror to a nearby one or simply use the default mirror. In general, you will experience very fast downloading if you use a nearby mirror. In recent months, some mirrors have started using HTTPS to secure data transfers. If Use secure download method for HTTP is checked, then you can only view HTTPS mirrors.

Once a mirror is chosen, to download and install a package in R becomes extremely easy. Just call install.packages("ggplot2"), and R will automatically download the package, install it, and sometimes compile it.

RStudio also provides an easy way to install packages. Just go to the Packages pane and click on Install. The following dialog appears:

Installing packages from CRAN

As the package description shows, a package may depend on other packages. In other words, when you call a function in the package, the function also calls some functions in other packages, which requires that you also install those packages as well. Fortunately, install.packages() is smart enough to know the dependency structure of the package to install and will install those packages first.

In the main page of METACRAN (http://www.r-pkg.org/), featured packages are those with the most stars on GitHub. That is, these packages are marked by many GitHub users. You may want to install multiple featured packages in one call, which is naturally allowed if you write the package names as a character vector:

install.packages(c("ggplot2", "shiny", "knitr", "dplyr", "data.table"))

Then the install.packages() function automatically resolves the joint dependency structure of all these packages and installs them.

Updating packages from CRAN

By default, the install.packages() function installs the latest version of the specified packages. Once they are installed, the package version stays fixed. However, the packages may be updated to fix bugs or add new features. Sometimes, an updated version of a package may deprecate functions in older versions with warnings. In these cases, we may keep the package out-of-date, or update it after reading the NEWS package, which can be found in the package description (for example, https://cran.r-project.org/web/packages/ggplot2/news.html; see this for details about the new version in the case of breaking changes).

RStudio provides an Update button next to Install in the package pane. We can also use the following function and choose which packages are going to be updated:

update.packages()

Both RStudio and the preceding function scan newer versions of packages and install these packages along with dependencies if necessary.

Installing packages from online repositories

Nowadays, many package authors host their work on GitHub because version control and community development are very easy, thanks to the well-designed issue-tracking systems and merge request system. Some authors do not release their work to CRAN, and others only release the stable versions to CRAN and keep new versions under development on GitHub.

If you want to try the latest development version, which often has new features or has fixed some bugs, you can directly install the package from the online repository using the devtools package.

First, install the devtools package if it does not appear in your library:

install.packages("devtools")

Then, use install_github() in the devtools package to install the latest development version of ggplot2:

library(devtools)
install_github("hadley/ggplot2")

The devtools package will download the source code from GitHub and makes it a package in your library. If your library has already got the package, the installation will replace it without asking. If you want to revert the development version to the latest CRAN version, you can run the CRAN installing code again:

install.packages("ggplot2")

Then, the local version (GitHub version) is replaced by the CRAN version.

Using package functions

There are two ways to use the functions in a package. First, we can call library() to attach the package so that the functions in it can be directly called. Second, we can call package::function() to only use the function without attaching the whole package to the environment.

For example, some statistical estimators are not implemented as built-in functions in base R but in other packages. One instance is skewness; the statistical function is provided by the moments package.

To calculate the skewness of numeric vector x, we can attach the package first and directly call the function:

library(moments)skewness(x)

Alternatively, we can call package functions without attaching the package, using :::

moments::skewness(x)

The two methods return the same result, but they work in different ways and have a different impact on the environment. More specifically, the first method (using library()) modifies the search path of symbols, whereas the second method (using ::) does not. When you call library(moments), the package is attached and added to the search path so that the package functions are directly available in subsequent code.

Sometimes, it is useful to see what packages we are using by calling sessionInfo():

sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_UnitedStates.1252
## [2] LC_CTYPE=English_UnitedStates.1252
## [3] LC_MONETARY=English_UnitedStates.1252
## [4] LC_NUMERIC=C 
## [5] LC_TIME=English_UnitedStates.1252
## 
## attached base packages:
## [1] stats graphics grDevicesutils datasets 
## [6] methods base 
## 
## loaded via a namespace (and not attached):
## [1] magrittr_1.5formatR_1.2.1tools_3.2.3
## [4] htmltools_0.3yaml_2.1.13stringi_1.0-1 
## [7] rmarkdown_0.9.2knitr_1.12stringr_1.0.0
## [10] digest_0.6.8evaluate_0.8

The session info shows the R version and lists the attached packages and loaded packages. When we use :: to access a function in a package, the package is not attached but loaded in memory. In this case, other functions in the package are still not directly available:

moments::skewness(c(1, 2, 3, 2, 1))
## [1] 0.3436216
sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_UnitedStates.1252
## [2] LC_CTYPE=English_UnitedStates.1252
## [3] LC_MONETARY=English_UnitedStates.1252
## [4] LC_NUMERIC=C 
## [5] LC_TIME=English_UnitedStates.1252
## 
## attached base packages:
## [1] stats graphics grDevicesutils datasets 
## [6] methods base 
## 
## loaded via a namespace (and not attached):
## [1] magrittr_1.5formatR_1.2.1tools_3.2.3
## [4] htmltools_0.3yaml_2.1.13stringi_1.0-1 
## [7] rmarkdown_0.9.2knitr_1.12stringr_1.0.0
## [10] digest_0.6.8moments_0.14evaluate_0.8

This shows that the moments package is loaded but not attached. When we calllibrary(moments), the package will be attached:

library(moments)sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 10586)
## 
## locale:
## [1] LC_COLLATE=English_UnitedStates.1252
## [2] LC_CTYPE=English_UnitedStates.1252
## [3] LC_MONETARY=English_UnitedStates.1252
## [4] LC_NUMERIC=C 
## [5] LC_TIME=English_UnitedStates.1252
## 
## attached base packages:
## [1] stats graphics grDevicesutils datasets 
## [6] methods base 
## 
## other attached packages:
## [1] moments_0.14
## 
## loaded via a namespace (and not attached):
## [1] magrittr_1.5formatR_1.2.1tools_3.2.3
## [4] htmltools_0.3yaml_2.1.13stringi_1.0-1 
## [7] rmarkdown_0.9.2knitr_1.12stringr_1.0.0
## [10] digest_0.6.8evaluate_0.8
skewness(c(1, 2, 3, 2, 1))
## [1] 0.3436216

Then, skewness() as well as other package functions in moments are directly available.

An easier way to see attached packages is search():

search()
## [1] ".GlobalEnv" "package:moments" 
## [3] "package:stats" "package:graphics" 
## [5] "package:grDevices" "package:utils" 
## [7] "package:datasets" "package:methods" 
## [9] "Autoloads" "package:base"

The function returns the current search path of symbols. When you evaluate a function call that uses skewness, it finds a skewness symbol in the current environment first. Then, it goes to package:moment and the symbol is found. If the package is not attached, the symbol will not be found, so an error will occur. We will cover this symbol-finding mechanism in later chapters.

To attach a package, require() is similar to library(), but it returns a logical value to indicate whether the package is successfully attached:

loaded <- require(moments)
## Loading required package: moments
loaded
## [1] TRUE

This feature allows the following code to attach a package if it is installed or install it if it is not yet installed:

if (!require(moments)) {  install.packages("moments")  library(moments)}

However, most uses of the require() function in user code are not like this. The following is typical:

require(moments)

This looks equivalent to using library() but has a silent drawback:

require(testPkg)
## Loading required package: testPkg
## Warning in library(package, lib.loc = lib.loc,
## character.only = TRUE, logical.return = TRUE, : there is no
## package called 'testPkg'

If the package to attach is not installed or even does not exist at all (maybe a typo), require() only produces a warning instead of an error like that produced by library():

library(testPkg)
## Error in library(testPkg): there is no package called 'testPkg'

Imagine you are running a long and time-consuming R script that depends on several packages. If you use require() and unfortunately the computer running your script does not happen to have installed the required packages, the script will only fail later, when the package function is being called and the function is not found. However, if you use library() instead, the script will stop immediately if the packages do not exist on the running computer. Yihui Xie wrote a blog (http://yihui.name/en/2014/07/library-vs-require/) on this issue and proposes the fail fast principle: if the task has to fail, it is better to fail fast.

Masking and name conflicts

A fresh R session starts with basic packages automatically attached. The basic packages refer to base, stats, graphics, and so on. With these packages attached, you can calculate the average of a numeric vector using mean() and the median of it using median(), without using base::mean() and stats::median() or having to manually attach base and stats packages.

In fact, thousands of functions are immediately available from automatically attached packages, and each package defines a number of functions for a particular purpose. Therefore, it is likely that the functions in two packages conflict with each other. For example, suppose two packages A and B both have a function named X. If you attach A and then attach B, the function A::X will be masked by the function B::X. In other words, when you attach A and you call X(), then A's X is called. Then, you attach B and call X(); it is now B's X that is called. This mechanism is known as masking. The following example shows what happens when masking occurs.

The powerful data manipulation package dplyr defines a family of functions that make it easier to manipulate tabular data. When we attach the package, some messages are printed to show you that some existing functions are masked by the package functions with the same names:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
## filter, lag
## The following objects are masked from 'package:base':
## 
## intersect, setdiff, setequal, union

Fortunately, the implementation of these functions in dplyr does not change the meaning and usage, but generalizes them. These functions are compatible with the masked version. Therefore, you don't have to worry that the masked functions are broken and no longer work.

Package functions that mask basic functions almost always generalize rather than replace. However, if you have to use two packages in which some functions share the same names, you had better not attach either package; instead, extract the functions from both packages you need, as shown here:

fun1 <- package1::some_function
fun2 <- pacakge2::some_function

If you happen to attach one package and want to detach it, you can call unloadNamespace(). For example, we have attached moments and we can detach it:

unloadNamespace("moments")

As soon as the package is detached, the package functions are no longer directly available:

skewness(c(1, 2, 3, 2, 1))
## Error in eval(expr, envir, enclos): could not find function "skewness"

However, you can still use :: to call the function:

moments::skewness(c(1, 2, 3, 2, 1))
## [1] 0.3436216

Checking whether a package is installed

It is useful to know that install.packages() performs the installation, while installed.packages() shows information about the installed packages, which is a matrix of 16 columns that covers a wide range of information:

pkgs <- installed.packages()
colnames(pkgs)
## [1] "Package" "LibPath" 
## [3] "Version" "Priority" 
## [5] "Depends" "Imports" 
## [7] "LinkingTo" "Suggests" 
## [9] "Enhances" "License" 
## [11] "License_is_FOSS" "License_restricts_use"
## [13] "OS_type" "MD5sum" 
## [15] "NeedsCompilation" "Built"

This can be useful when you need to check whether a package is installed:

c("moments", "testPkg") %in% installed.packages()[, "Package"]
## [1] TRUE FALSE

Sometimes, you need to check the version of a package:

installed.packages()["moments", "Version"]
## [1] "0.14"

A simpler way to get the package version is using the following command:

packageVersion("moments")
## [1] '0.14'

We can compare two package versions so that we can check whether a package is newer than a given version:

packageVersion("moments") >= package_version("0.14")
## [1] TRUE

In fact, we can directly use a string version to perform the comparison:

packageVersion("moments") >= "0.14"
## [1] TRUE

Checking package versions can be useful if your scripts depend on some packages that must be equal to or newer than, a specific version. This can be true if your scripts rely on some of the new features introduced in that version. In addition, packageVersion() will produce an error if a package is not installed, which also makes it check the package installation status.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset