15

Applications

So far we have been introducing the usage of knitr with short examples for the sake of simplicity. In this chapter we use some concrete and complete examples to show how knitr works with real applications; we do not explain every single detail of these applications, and we only point out the critical parts in them.

15.1    Homework

For homework applications, R Markdown might be the preferred document format to work with due to its simplicity, and homework is usually not targeted at publication. As mentioned before, RPubs (http://rpubs.com) is a platform for sharing (HTML) reports generated from RStudio by knitr. There are many homework submissions, too.

Since a homework report is relatively simple, we may not need too many knitr features; some common features used in homework are: set the size of plots (fig.width and fig.height), hide the source code because the grader may not wish to read it (echo = FALSE), and enable cache for time-consuming computing jobs (cache = TRUE), etc. Other features that come by default such as tidy = TRUE and highlight = TRUE can help users who do not care about coding styles produce more readable code in the output document.

Now we show an example of Gibbs sampling. For the bivariate Normal distribution

[XY]N([μXμY],[σX2ρσXσYρσXσYσY2])

(15.1)

we know the conditional distributions

Y|X=xN(μY+σYσXρ(xμX),(1ρ2)σY2)X|Y=yN(μX+σXσYρ(yμY),(1ρ2)σX2)

(15.2)

so we can use the Gibbs sampling to generate random numbers from the joint Normal distribution. First we initialize x(0) and y(0), then repeatedly generate x(k)f(x|y(k1)) and y(k)f(y|x(k)). The R code below is a translation of 15.2:

Image

Figure 15.1 shows the first 20 steps of Gibbs sampling for the bivariate Normal distribution with μX=0,σX=2,μY=1,σY=3,ρ=0.7.

Image

And we can draw some samples as well:

Image

Figure 15.2 shows 5,000 samples from this distribution, and we can calculate the sample means, standard deviations, and the correlation, which should be close to the corresponding theoretical values:

Image

FIGURE 15.1: Trace of Gibbs sampling for a bivariate Normal distribution: the arrows show the first 20 steps of Gibbs sampling.

Image

FIGURE 15.2: 5000 points from Gibbs sampling: the smoothed scatterplot shows the density of the 2D distribution.

Image

In this small application, we used cache (although this particular example is not too slow) and TikZ graphics. We adjusted the plot sizes (5 x 3 for Figure 15.1 and 5 × 4 for Figure 15.2). Note the narratives and code chunks are interwoven, and the reader can learn the theory, see the computing, and verify the results in the same report. Everything is transparent, and it will be easy to find out errors. Sometimes the computer code we write may not really reflect what we said in theory, and it will be hard to find out such errors if we separate computing from reporting.

In terms of data, code and software sharing, we cannot yet rely on goodwill and self discipline when it comes to sharing publication material and making studies fully reproducible.

Huang and Gottardo (2013)

Comparability and reproducibility of biomedical data

People have been proposing sharing data, code, and software in data analysis for the sake of reproducible research, e.g., Huang and Gottardo (2013). We believe that more efforts in education should be an important step, and we can start with reproducible homework.

15.2    Serve Dynamic Documents

The servr package (Xie, 2015c) provides some simple HTTP server functions to serve files under a given directory based on the httpuv package. To some degree, this package is like python -m SimpleHTTPServer or python -m http.server if you are familiar with Python. Originally it was designed to serve static files under a directory, and the main function was httd():

Image

If you run the above function in the R console, R will launch your Web browser to show a list of files under the current working directory (./), or show index.html if this file exists. You can click the links on the files to view their content.

Later servr was extended based on knitr and rmarkdown, so it can also serve dynamic R Markdown documents. There are functions jekyll(), rmdv1(), and rmdv2() in this package to serve HTML files generated from R Markdown documents (via knitr or rmarkdown). R Markdown documents can be automatically recompiled when their HTML output files are older than the corresponding source files, and HTML pages in the Web browser can be automatically refreshed accordingly, so you can focus on writing R Markdown documents, and results will be updated on the fly in the Web browser. This saves you two steps: click the Knit HTML button, and refresh the Web browser. Both steps can be distracting when you write a report. With servr, all you need to do is write the R Markdown document after you launch a server.

This is even more useful when you write R Markdown documents in the RStudio IDE, because servr has set the Web browser to be the RStudio Viewer by default when it detects the RStudio IDE, and you can put the source document and its output side by side like the layout in Figure 15.3. It is completely fine if you do not use RStudio — the automatic compilation and refreshing also work if you use other editors and Web browsers.

The functions rmdv1() and rmdv2() correspond to R Markdown v1 and v2, respectively. After you call servr::rmdv1() or servr::rmdv2() in the R console, you can click the HTML file foo.html if it has its source document foo.Rmd, and view the HTML output. Then whenever you edit foo.Rmd and save it, servr will automatically recompile it and refresh the HTML output page.

The function jekyll() is like rmdv1() and rmdv2(), but is tailored for Jekyll websites. We have briefly introduced Jekyll in Section 13.4. It is tedious to compile R Markdown posts or pages to Markdown again and again, and that is why jekyll() can be useful. Once you call the function servr::jekyll() in the root directory of a Jekyll website, you will get a preview of the website in your Web browser. Besides, as you edit and save your blog post, the Web browser will refresh the page to show the updated output. The knitr-jekyll repository (https://github.com/yihui/knitr-jekyll) is an example of serving Jekyll websites using servr.

Image

FIGURE 15.3: The layout of an R Markdown document (top-left panel) and its output in the RStudio Viewer (right panel): we typed a servr function in the R console (bottom-left), and the output of the R Markdown is showed in the RStudio Viewer. This figure is only for illustration purposes; see https://github.com/yihui/servr for the original image if you want to read the text in it.

Later we will introduce package vignettes in Section 15.4, and the function vign() in servr can be used to serve HTML vignettes while we develop an R package. Its advantage is that it does not preserve the HTML output file in the source package when serving the vignette, which makes the source package clean.

For those who are curious about the technical details, the implementation is based on WebSockets. When servr shows an HTML page, it also injects a piece of JavaScript code in it to set up a WebSocket connection to talk to R periodically (e.g., on one-second basis). Every time R receives a request from the WebSocket, it will compare the timestamps of Rmd files with their output HTML files. If an Rmd file is newer than its HTML output, servr will call knitr or rmarkdown to recompile the Rmd file to HTML, then send a message back to the WebSocket.

Image

FIGURE 15.4: A Makefile example for the function make() in servr: the HTML file to be generated is specified in the target all, and a rule is specified on how to generate an HTML file from an Rmd file via rmarkdown.

When the WebSocket receives this message, it calls location.reload() in JavaScript to refresh the page.

A critical step in this process is to check if we need to recompile any Rmd files. This is a task that GNU Make (http://www.gnu.org/software/make/) is good at, so servr also provided a function make() so that you can provide your own Makefile to rebuild Rmd files when necessary. Figure 15.4 is an example Makefile for the make() function.

By default, a server function will block the current R session, which can be a problem if you want to continue working in the same R session. To solve this problem, you can use the argument daemon = TRUE for the server function, e.g., httd(daemon = TRUE), or rmdv2(daemon = TRUE). This tells servr to launch a daemonized server that will not block the current R session.

15.3    Website and Blogging

We introduce a few websites and blogs built upon knitr in this section, and the Web pages are created from either R Markdown or R HTML.

15.3.1  Vistat and Rcpp Gallery

Vistat (http://vis.supstat.com) is a website based on R Markdown and Jekyll (Section 13.4). It aims to provide a gallery of reproducible statistical graphics. The repository for the website is publicly available on Github: https://github.com/supstat/vistat.

The core of this repository is the R script ./_bin/knit, which sets some global chunk options and compiles Rmd documents to Markdown output. Math equations are rendered by MathJax, animations are supported through the SciAnimator library (Section 7.3.1), and we can also create Web graphics via the D3 library.

After knitr has compiled Rmd source files to Markdown files, Jekyll can compile Markdown to HTML, which gives us a website.

The Rcpp Gallery (http://gallery.rcpp.org) is a website for Rcpp (Eddelbuettel et al., 2015) articles and examples, and it is also built on R Markdown; in particular, it uses knitr’s Rcpp engine (Section 11.2.1).

15.3.2  UCLA R Tutorial

The UCLA Statistical Consulting Group has maintained software tutorials for several statistical packages for many years, and one of them is dedicated to R: http://www.ats.ucla.edu/stat/r/. Before 2012, this website was built by cut-and-paste. The results were generated in R and copied into the HTML pages. After knitr was released in 2012, one of the Web administrators, Joshua Wiley, decided to rewrite the R tutorial pages with knitr instead of using the R HTML format. Now it is much easier to maintain the Web pages, and the R output also has much better reproducibility. After R is updated or any dataset is changed, the whole website can be rebuilt automatically by compiling all source documents again.

15.3.3  The cda and RHadoop Wiki

Github has an integrated Wiki system for each repository. We can write wiki pages in a variety of formats, such as Markdown and reStructured-Text, etc. Each page is essentially a file, and the wiki is essentially a Git repository; therefore we can write Rmd files and compile them to Markdown files, and push to Github through Git.

The cda package (Auguie, 2013) used the above approach to build its wiki site on Github: https://github.com/baptiste/cda/wiki. We can find the Rmd source files under the wiki directory of the package.

The RHadoop project has a similar wiki at https://github.com/RevolutionAnalytics/RHadoop/wiki.

15.3.4  The ggbio Package

The ggbio package (Yin et al., 2012) is an R implementation for extending the Grammar of Graphics for genomic data based on the ggplot2 package. It has a website, http://tengfei.github.com/ggbio/, on which we can find its documentation. The function knit_rd() (Section 12.4.8) was used to compile its R documentation pages to HTML, so we can directly see the output of the examples. Once this package has been installed, it only needs one line of code to get the HTML pages:

Image

Then we can publish the HTML files to Github, and we do not need to do anything with the images because they are base64 encoded in the files.

By the way, the ggbio package also has a PDF vignette written with knitr, which can be found on the website or with the command:

Image

15.3.5  Geospatial Data in R and Beyond

Barry Rowlingson gave a tutorial workshop on geospatial data analysis in R at the useR! 2012 conference, and here is the corresponding website: http://www.maths.lancs.ac.uk/~rowlings/Teaching/UseR2012/. The website was created from R HTML files and has a nice style from Twitter Bootstrap (a popular CSS framework). The advantage of using R HTML over R Markdown is that we have full control of the style; this website is a good example of arranging R code chunks and output in div elements with custom CSS styles.

15.4    Package Vignettes

As discussed by Gentleman and Temple Lang (2004), R packages have the great potential of building and disseminating reproducible reports, besides their obvious functionality of providing computing routines. Specifically, R package vignettes can be an ideal format for writing reproducible reports, with other components of the package providing the infrastructure such as functions, unit tests, and datasets. An R package vignette is just like a paper, and the output is dynamically compiled from its source document during the package building process, i.e., R CMD build.

For R under the version 3.0.0, it uses Sweave to build package vignettes. Due to the limitations of Sweave (Section 16.1) and the barrier of Image, R package vignettes were not widely used before R 3.0.0. Bio-Conductor is an exception, though, because vignettes are mandatory for packages on BioConductor.

It has become much more natural and easy to compile package vignettes since R 3.0.0, thanks to Henrik Bengtsson, Duncan Murdoch, and R core. Now there are more than 500 package vignettes compiled from knitr in about 300 packages on CRAN (https://gist.github.com/yihui/7698648). In the next section, we introduce knitr vignette engines, and then we show a few examples. Sections 15.4.3 and 15.4.4 are only for those who are interested in older versions of R, and we no longer recommend that you use the tricks mentioned in these two sections.

15.4.1  Vignette Metadata and Engines

To use knitr to build vignettes, we only need to follow these simple steps:

•  specify a vignette engine, such as %VignetteEngine{knitr::knitr}, in the vignette source document (e.g., an Rnw or Rmd file)

•  add a field VignetteBuilder: knitr in the package DESCRIPTION file

•  add knitr to the Suggests field in DESCRIPTION

Then we can write vignettes using the knitr syntax (e.g., <<>>= or```{r} for code chunks). Remember vignettes are put under the vignettes/ directory of the package root directory.

According to the R manual “Writing R Extensions,” we also have to write the title of the vignette in VignetteIndexEntry{}. There are a few other optional metadata specifications such as VignetteKeyword{}. See Figure 15.5 for an example of the vignette metadata (title and vignette engine) for an R Markdown v2 vignette in knitr. After we build the package, the vignettes will be listed in an HTML index page.

The knitr package has several PDF and HTML vignettes compiled in this way, and we can view them by running:

Image

The vignette engine knitr::knitr is only one of the possible engines in knitr. To see all of them, you can use the function vignetteEngine() in the tools package:

Image

FIGURE 15.5: The metadata of a knitr vignette: this is extracted from the knitr vignette, and you can find it from system.file(’doc’, ’knitr-intro.Rmd’, package=’knitr’).

Image

The engines with the suffix _notangle have the same weave functions as those without the suffix, but have disabled the tangle function, meaning that there will not be R scripts generated from vignettes during R CMD build or R CMD check. Sometimes we may not want to tangle R scripts from vignettes, because it is redundant for R CMD check to run the same code again after the code has been executed in weave, and currently the inline R code expressions are not included in the tangle output, which can also cause problems.

Please note the :: operator has no special meaning in a vignette engine. It can be misleading because :: is an operator in base R that fetches an exported object from a package, e.g., stats::lm. However, in the vignette engine notation, :: is nothing but a delimiter that separates the package name from the engine name, so knitr::rmarkdown does not mean rmarkdown is a function in knitr, but only one of the vignette engines in knitr.

When you use the rmarkdown vignette engine, you are free to choose the output format, as long as the filename extension is .html or .pdf, because R only recognizes these two types of vignette output at the moment. When the output format is HTML, it can be an HTML document, or any of the HTML5 presentations (e.g., ioslides or Slidy). When it is PDF, it can be either a PDF document or Beamer slides.

15.4.2  Vignette Examples

We have put together a list of vignettes from current CRAN packages using the knitr vignette engines at https://gist.github.com/yihui/7698648, and you can learn from these examples.

The ggplot2 transition guide by Murphy (2012) is a great example of an R package vignette, although it is not shipped with the ggplot2 package. This guide was intended to announce new features and explain changes in ggplot2 0.9.0, which may affect users of older versions.

One nice feature of this guide is that we can compile the Rnw document to either a color or a black/white version, which is controlled by a global variable bw_version; if it is TRUE, a black and white version will be produced. This is achieved by setting the chunk options eval = bw_version and echo = bw_version for the chunks that produce black/white plots, and in ggplot2 this means theme_bw() and gray scales such as scale_fill_gray(). When bw_version is FALSE, these chunks will be hidden from the output (the source code is neither evaluated nor echoed). Similarly, there are some other chunks that have the options eval = !bw_version and echo = !bw_version, and these chunks produce color plots. In all, we can control if the PDF output is color or black/white by a single variable, which is very convenient (recall Section 5.1.1). Figure 15.6 is a sample page of the transition guide from the color version.

The corrplot package (Wei, 2013) has an example of HTML vignettes. You can find the source document of its vignette on Github at https://github.com/taiyun/corrplot/tree/master/vignettes. Obviously, it is an Rmd document (Section 5.2.1). Note it uses R Markdown v1. Open it with a text editor (e.g., RStudio) and we will see R code chunks in it. We can view the HTML vignette compiled from it in the Web browser by running:

Image

This shows the HTML index page of the corrplot documentation, and we can see the link to the vignette “Overview of user guides and package vignettes.” Since corrplot is a package for visualizing correlation matrices, it has many graphical examples, which are shown in its HTML vignette.

Image

FIGURE 15.6: A sample page of the ggplot2 transition guide: introducing the new geom added to ggplot2 0.9.0 — geom_violin().

Image

FIGURE 15.7: The Makefile to compile PDF vignettes using knitr: use knit2pdf() to compile Rnw documents to PDF.

The source package of knitr contains a mixture of PDF and HTML vignettes, all of which are listed in the HTML help page of this package.

The sampSurf package (Gove, 2013) also has a nice HTML vignette at http://sampsurf.r-forge.r-project.org, which was created from an R HTML source document and even contains some 3D plots produced by the rgl package.

15.4.3  PDF Vignette

If we want to build vignettes with knitr for R <= 3.0.0, we have to use some tricks. One way to do this is through a Makefile (http://www.gnu.org/software/make/), which will be used by R CMD build when building vignettes. In this Makefile, we can set our rules to create the PDF file using a custom tool like knitr.

The Makefile is under the vignettes/ directory in the source package. When R compiles vignettes, it calls Sweave() first; if there is a Makefile, the make command will be run on it. In the Makefile, we also have access to R, so it is possible to call knitr via command line to compile vignettes. Figure 15.7 shows a sample of the Makefile to be used to compile vignettes with knitr. The key is to run knitr::knit2pdf() on the Rnw files; we put all PDF files to be generated in the variable PDFS.

Obviously, the disadvantage of this approach is that all Rnw documents have to be compiled by Sweave before any further processing. Besides, the new approach in R >= 3.0.0 does not require the make utility to be installed.

Image

FIGURE 15.8: The Makefile to compile HTML vignettes: use knit2html() to compile Rmd documents to HTML.

15.4.4  HTML Vignette

Similarly, we can create package vignettes in the HTML format from R Markdown documents. Again, the HTML vignettes had to be compiled by a Makefile before R 3.0.0. Figure 15.8 shows the source of a sample Makefile for building HTML vignettes, where the function knit2html() was called. Note make clean will remove the figure/ directory, which is due to the fact that images generated by knitr will be base64 encoded in the HTML output, so the image files are no longer needed.

15.5    Books

We can also write books with knitr. At the time of writing this book, at least one book has been published (Lebanon, 2012), and the book Regression Modeling Strategies (Harrell, 2001) is under revision for a new edition, which is based on knitr.

15.5.1  This Book

In the spirit of “eating one’s own dog food” (see Wikipedia if this is unclear), this book was written with knitr in Image (see Section 4.2). The whole book is in one Image file, although it is entirely possible to split chapters into individual files.

A few chunk options were set globally in the very beginning of the document, such as cache = TRUE (for speed), dev = ’tikz’ (for style of graphics), and fig.align = ’center’ (for alignment of plots). We also set options(formatR.arrow = TRUE) (see the formatR package), because the author’s preference of the assignment operator is = instead of <-, but <- is more commonly used by R users; this option allows the equal signs to be replaced by the left arrows automatically wherever applicable, although all I typed are actually equal signs.

We have a few chunk hooks (Chapter 10) in this book for various purposes. For example, there is a par hook that sets the graphical parameters to this:

Image

So when we want to use this set of parameters, we just add a chunk option par = TRUE instead of having to type it again and again.

Although we see the code chunks and the plots are separate in this book, that is not true in the source document: the code chunks are actually inside the figure environments, but we used the document hook hook_movecode() to move code chunks out of the figure environments eventually.

Because we have to show chunk headers occasionally for pedagogical purposes, we have a chunk hook named append to add <<>>= and @ to the chunk output:

Image

Basically this hook enables us to write additional character strings before and/or after a chunk; e.g., we can use the chunk option append = list(’<<A>>=’, ’@’) to add the syntax information to the chunk output. We need to use this hook because we cannot write the chunk headers directly in the source document, otherwise they will be parsed and disappear in the final output.

There is an output hook that modifies the default plot hook function by adding a frame box to a plot, and it was used in Figure 10.3 and Figure 10.4.

The bibliography database of all R packages is dynamically written by the write_bib() function as introduced in Section 12.4.1, so it is guaranteed that the version information is up to date (at least before the manuscript was submitted to the publisher).

15.5.2  The Analysis of Data

Another notable example is the book The Analysis of Data by Lebanon (2012); the most notable feature of this book is that it has the double PDF/HTML versions. The HTML version is freely available at http://theanalysisofdata.com. Both versions are produced from essentially the same set of source documents. For the HTML version, there are additional settings, for example, the typesetting of math equations is done by the MathJax library, so it has to be included in the head section of the HTML source.

15.5.3  The Statistical Sleuth in R

The Statistical Sleuth (Ramsey and Schafer, 2002) is an excellent text in statistics, and one feature of this book is that it has a large number of datasets. The book itself was not written with knitr, but some other authors (Horton et al., 2012) have created a website (http://www.math.smith.edu/~nhorton/sleuth/) in which they re-did a lot of the data analysis examples in the book in R. You can check out both the PDF documents and the Rnw source files on the website.

15.5.4  Text Analysis with R for Students of Literature

The book Text Analysis with R for Students of Literature by Jockers (2014) was written using Image and knitr. The most amazing fact about this book is perhaps that its author taught himself Image before he started putting together this book in Image, and finished the book draft in just a couple of months. The book is an introduction to computational text analysis, and has a lot of short examples. It would be extremely tedious if the author had to run each example and copy the output to the Image manuscript by hand.

15.6    Literate Programming for R Packages

Although we have introduced Literate Programming (LP) in the beginning of this book, we do not actually use the knitr package for programming purposes. Most of the time we use knitr for data analysis and reporting purposes instead. The original LP paradigm is about both weaving and tangling: we may weave a source document to software documentation, or tangle the program code to execute it. Apparently, we do not really have to tangle the program code for execution purposes when using knitr, because code execution occurs right in the process of weaving.

Interestingly, the most common application of Knuth’s original LP paradigm seems to be documenting software (using a special form of comments) for users instead of “programming” for package authors. In other words, we use LP to document the usage of software, instead of documenting the source code. See Doxygen (van Heesch, 2008), Javadoc (http://en.wikipedia.org/wiki/Javadoc), and roxygen2 (Wickham et al., 2015) for examples. There exists one exception, though, in the Image world. Some Image package authors write both Image code and documentation in a single document, and weave it into a PDF document that contains both the source code and documentation. This is not entirely surprising, considering Knuth’s original implementation of LP using Image and Pascal. There is a small number of R packages using LP as well, such as Terry Therneau’s survival and coxme packages.

LP does not seem to be a popular approach to programming, but it is still an interesting idea, and can be useful especially when it is applied to your own favorite language. It may be boring for some people to read Image source code, but reading R source code can be more pleasant. Objective opinions aside, we believe LP has at least two advantages:

1.  You can write much more extensive and richer documentation than you normally could do with comments. In general, comments in code are (or should be) brief and limited to plain text. Normally you will not write five paragraphs of comments to explain a few lines of code, and you cannot write readable math expressions or embed a video in comments.

2.  You can label code chunks and reference/reuse them using the labels, which allows you to compose your program flexibly using different pieces of code chunks. For example, you can define and explain a code chunk later in the document, but insert it in a previous code chunk using its label. This feature has been emphasized by Knuth, but it is not widely adopted for some reason. Perhaps most people are more comfortable with designing a big program by smaller units like functions instead of code chunks, which is actually a good idea.

In fact, we can apply LP to developing R packages. There are multiple ways to achieve the goal, and we only introduce one here, using the following tools:

1.  The purl() function in knitr, which makes it possible to extract program code from a source document;

2.  Package vignettes, which can contain both program code and documentation;

3.  GNU Make, which allows us to define when and how to generate an output file from a source file.

The rlp package (https://github.com/yihui/rlp) is an example of writing an R package using LP techniques. You can find details in this repository, and the basic idea of the implementation is:

1.  Instead of writing R source code under the R/ directory of the package, we can write the code in package vignettes (R Markdown) under the vignettes/ directory;

2.  Use a Makefile to define how to generate R scripts R/*.R from vignettes vignettes/*.Rmd;

3.  Run make to generate R scripts to R/ and R CMD build to build the package.

These steps can be made easy by using the RStudio IDE, and we can actually just click a button to do the these steps. The implementation details are too technical and specific for this book, and we will leave it to the readers to go through the documentation of this package.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset