Chapter 23. Reproducibility, Reports and Slide Shows with knitr

Successfully delivering the results of an analysis can be just as important as the analysis itself, so it is vital to communicate them in an effective way. This can be a written report, a Web site of results, a slide show or a dashboard. In this chapter we focus on the first three, which are made remarkably easy using knitr, a package written by Yihui Xie.

knitr was initially created as a replacement for Sweave for the creation of PDF documents using LATEX interweaved with R code and the generated results. It has since added the capability to work with Markdown for generating HTML documents. While interweaving R code in LATEX and Markdown requires using a somewhat different syntax for each, the programs are similar enough to make them both easy to work with. First we discuss working with LATEX documents, and then Markdown.

The combination of knitr and RStudio is so powerful that it was possible to write this entire book inside the RStudio IDE using knitr to insert and run R code and graphics.

23.1. Installing a LATEX Program

LATEX (pronounced “lay-tech”) is a markup language based on the TEX typesetting system created by Donald Knuth. It is regularly used for writing scientific papers and books, including this one. Like any other program, LATEX must be installed before it can be used.

Each of the operating systems uses a different LATEX distribution. Table 23.1 lists OS-specific distributions and download locations.

Image

Table 23.1 LATEX Distributions and their Locations

23.2. LATEX Primer

This is not intended to be anywhere near a comprehensive lesson in LATEX, but it should be enough to get started with making documents. LATEX documents should be saved with a .tex extension to identify them as such. While RStudio is intended for working with R, it is a suitable text editor for LATEX and is the environment we will be using.

The very first line in a LATEX file declares the type of document, the most common being “article” and “book.” This is done with documentclass{...}, replacing . . . with the desired document class.

Immediately following the declaration of the documentclass is the preamble. This is where commands that affect the document go, such as what packages to load (LATEX packages) using usepackage{...} and making an index with makeindex.

In order to include images, it is advisable to use the graphicx package. This allows us to specify the type of image file that will be used by entering DeclareGraphics Extensions{.png,.jpg}, which means LATEX will first search for files ending in .png and then search for files ending in .jpg. This will be explained more when dealing with images later.

This is also where the title, author and date are declared with title, author and date, respectively. New shortcuts can be created here such as newcommand {dataframe}{texttt{data.frame}}, so that every time dataframe{} is typed it will be printed as data.frame, which appears in a typewriter font because of the exttt{...} .

The actual document starts with begin{document} and ends with end{document}. That is where all the content goes. So far our LATEX document looks like the following example.


documentclass{article}
% this is a comment
% all content following a % on a line will be commented out as if it
never existed to latex

usepackage{graphicx} % use graphics
DeclareGraphicsExtensions{.png,.jpg} % search for png then jpg

% define shortcut for dataframe
ewcommand{dataframe}{ exttt{data.frame}}

itle{A Simple Article}
author{Jared P. Lander\ Lander Analytics}
% the \ puts what follows on the next line
date{April 14th, 2013}

egin{document}

maketitle

Some Content

end{document}


Content can be split into sections using section{Section Name}. All text following this command will be part of that section until another section{...} is reached. Sections (and subsections and chapters) are automatically numbered by LATEX. If given a label using label{...} they can be referred to using ref{...}. The table of contents is automatically numbered and is created using tableofcontents. We can now further build out our document with some sections and a table of contents. Normally, LATEX must be run twice for cross references and the table of contents but RStudio, and most other LATEX editors, will do that automatically.


documentclass{article}
% this is a comment
% all content following a % on a line will be commented out as if it
never existed to latex

usepackage{graphicx} % use graphics
DeclareGraphicsExtensions{.png,.jpg} % search for png then jpg

% define shortcut for dataframe
ewcommand{dataframe}{ exttt{data.frame}}

itle{A Simple Article}
author{Jared P. Lander\ Lander Analytics}
% the \ puts what follows on the next line
date{April 14th, 2013}


egin{document}
maketitle % create the title page
ableofcontents % build table of contents

section{Getting Started}
label{sec:GettingStarted}
This is the first section of our article. The only thing it will talk
about is building dataframe{}s and not much else.

A new paragraph is started simply by leaving a blank line. That is all
that is required. Indenting will happen automatically.

section{More Information}
label{sec:MoreInfo}
Here is another section. In ~ ef{sec:GettingStarted} we learned
some basics and now we will see just a little more. Suppose this section is
getting too long so it should be broken up into subsections.

subsection{First Subsection}
label{FirstSub}
Content for a subsection.

subsection{Second Subsection}
label{SecondSub}
More content that is nested in section~ ef{sec:MoreInfo}

section{Last Section}
label{sec:LastBit}
This section was just created to show how to stop a preceding sub-
section, section or chapter. Note that chapters are only available in
books, not articles.

makeindex % create the index

end{document}


While there is certainly a lot more to be learned about LATEX, this should provide enough of a start for using it with knitr. A great reference is the “Not So Short Introduction to LATEX,” which can be found at http://tobi.oetiker.ch/lshort/lshort.pdf.

23.3. Using knitr with LATEX

Writing a LATEX document with R code is fairly straightforward. Regular text is written using normal LATEX conventions and the R code is delineated by special commands. All R code is preceded by <<label-name,option1='value1',option2='value2'>>= and is followed by @. While editing, RStudio nicely colors the background of the editor according to what is being written, LATEX or R code. This is seen in Figure 23.1, and is called a “chunk.”

Image

Figure 23.1 Screenshot of LATEX and R code in RStudio text editor. Notice that the code section is gray.

These documents are saved as .Rnw files. During the knitting process an .Rnw file is converted to a .tex file, which is then compiled to a PDF. If using the console, this is accomplished by calling the knit function, passing the .Rnw file as the first argument. In RStudio this is done by clicking the Image button in the toolbar or pressing Ctrl+Shift+I on the keyboard.

Chunks are the workforce of knitr and are essential to understand. A typical use is to show both the code and results. It is possible to do one or the other, or neither as well, but for now we will focus on getting code printed and evaluated. Suppose we want to illustrate loading ggplot2, viewing the head of the diamonds data, and then fitting a regression. The first step is to build a chunk.

<<diamonds-model>>=
# load ggplot
require(ggplot2)

# load and view the diamonds data
data(diamonds)
head(diamonds)

# fit the model
mod1 <- lm(price ~ carat + cut, data=diamonds)
# view a summary
summary(mod1)
@

This will then print both the code and the result in the final document as shown next.

> # load ggplot
> require(ggplot2)
>
> # load and view the diamonds data
> data(diamonds)
> head(diamonds)

  carat       cut color clarity depth table price    x    y    z
1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

>
> # fit the model
> mod1 <- lm(price ~ carat + cut, data = diamonds)
> # view a summary
> summary(mod1)



Call:
lm(formula = price ~ carat + cut, data = diamonds)

Residuals:
     Min       1Q   Median       3Q      Max
-17540.7   -791.6    -37.6    522.1  12721.4

Coefficients:
            Estimate Std. Error  t value Pr(>|t|)
(Intercept) -2701.38      15.43 -175.061  < 2e-16 ***
carat        7871.08      13.98  563.040  < 2e-16 ***
cut.L        1239.80      26.10   47.502  < 2e-16 ***
cut.Q        -528.60      23.13  -22.851  < 2e-16 ***
cut.C         367.91      20.21   18.201  < 2e-16 ***
cut^4          74.59      16.24    4.593 4.37e-06 ***
---
Signif. codes:  0 '***'  0.001 '**'  0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1511 on 53934 degrees of freedom
Multiple R-squared: 0.8565, Adjusted R-squared: 0.8565
F-statistic: 6.437e+04 on 5 and 53934 DF, p-value: < 2.2e-16

So far, the only thing supplied to the chunk was the label, in this case “diamonds-model.” It is best to avoid periods and spaces in chunk labels. Options can be passed to the chunk to control display and evaluation and are entered after the label, separated by commas. Some common knitr chunk options are listed in Table 23.2. These options can be strings, numbers, TRUE/FALSE or any R object that evaluates to one of these.

Image

Table 23.2 Common knitr Chunk Options

Displaying images is made incredibly easy with knitr. Simply running a command that generates a plot inserts the image immediately following that line of code, with further code and results printed after that.

The following chunk will print 1 + 1 followed by the result, plot(1:10) followed by an image, and 2 + 2 followed by the result.

<<inline-plot>>=
1 + 1
plot(1:10)
2 + 2
@


> 1 + 1

[1] 2

> plot(1:10)

> 2 + 2

[1] 4

Image

Adding the fig.cap option will put the image in a figure environment, which gets placed in a convenient spot with a caption. Running the same chunk with fig.cap set to "Simple plot of the numbers 1 through 10." will display 1 + 1 followed by the result, plot(1:10), and then 2 + 2 followed by the result. The image, along with the caption, will be place where there is room, which very well could be in between lines of code. Setting out.width to '.75\linewidth' (including the quote marks) will make the image’s width 75% of the width of the line. While linewidth is a LATEX command, because it is in an R string the backslash () needs to be escaped with another backslash. The resulting plot is shown in Figure 23.2.

<<figure-plot,fig.cap="Simple plot of the numbers 1 through 10.",
fig.scap="Simple plot of the numbers 1 through 10",
out.width='.75\linewidth'>>=
1 + 1
plot(1:10)
2 + 2
@


> 1 + 1

[1] 2

> plot(1:10)


> 2 + 2

[1] 4

Image

Figure 23.2 Simple plot of the numbers 1 through 10.

This just scratches the surface of what is possible with LATEX and knitr. More information can be found on Yihui’s site at http://yihui.name/knitr/. When using knitr it is considered good form to use a formal citation of the form Yihui Xie (2013). knitr: A general-purpose package for dynamic report generation in R. R package version 1.2. Proper citations can be found, for some packages, using the citation function.

> citation(package = "knitr")

To cite the 'knitr' package in publications use:

  Yihui Xie (2013). knitr: A general-purpose package for
  dynamic report generation in R. R package version 1.4.1.

  Yihui Xie (2013) Dynamic Documents with R and knitr. Chapman
  and Hall/CRC. ISBN 978-1482203530

  Yihui Xie (2013) knitr: A Comprehensive Tool for
  Reproducible Research in R. In Victoria Stodden, Friedrich
  Leisch and Roger D. Peng, editors, Implementing Reproducible
  Computational Research. Chapman and Hall/CRC. ISBN
  978-1466561595

23.4. Markdown Tips

While LATEX is a great tool for composing a book or an article, an easier tool is Markdown, which is ideal for Web sites and presentations.1 It is a simplified version of HTML that does away with the tedium typically involved in writing a Web page. There is also much less structure in Markdown than in LATEX, meaning less control but easier writing.

1. LATEX can produce presentations using Beamer but Markdown slide shows, as seen in Section 23.6, are quicker to build and allow for more interactivity.

Line breaks are created by leaving a blank line between blocks of text. Italics can be generated by putting an underscore (_) on both sides of a word, and bold is generated by putting two underscores on each side. Lists are created by putting each element on its own line starting with an asterisk (*). Text is made a header by starting a line with a pound symbol (#), the number of pounds indicating the header level.

Links are created by putting the text to be displayed in square brackets ([ ]) and the linked URL in parentheses. Inserting images is also done with square brackets and parentheses and preceded by an exclamation mark (!). A sample Markdown document is shown next.


# Title - Also a Header 1

_this will be italicized_

_ _this will be bolded_ _

## Header 2

Build a list

* Item 1
* Item 2
* Item 3

This is a link

[My Website](http://www.jaredlander.com)

## Another Header 2

This inserts an image

![Alt text goes in here](location-of-image.png)

#### Header 4


RStudio provides a handy quick reference guide to Markdown, accessed by clicking the Image button in the toolbar.

23.5. Using knitr and Markdown

The work flow for writing Markdown documents is similar to that for LATEX documents: Normal text (flavored with Markdown) is written and R code is put in chunks. The style of the chunks is different but the idea is the same. A file that contains both Markdown and R code is saved as an .Rmd file, and then knitted to a Markdown file (.md), which is compiled to an HTML file. In the console this is done with the knit function, and in RStudio with the Image button or Ctrl+Shift+H.

Chunks for Markdown documents start with ```{r label-name, option1='value1',option2='value2'} and end with ```. Otherwise, everything else is the same with exceptions for HTML conventions such as out.width='75%' as opposed to out.width='.75linewidth'. Following is the same chunk from earlier, but modified to meet the conventions needed for a Markdown document.

```{r figure-plot,fig.cap="Simple plot of the numbers 1 through 10.",
fig.scap="Simple plot of the numbers 1 through 10",
out.width='.75\linewidth'}
1 + 1
plot(1:10)
2 + 2
```

23.6. pandoc

Creating reproducible presentations without leaving the friendly confines of the R environment has long been possible using LATEX’s Beamer mode, which creates a PDF where each page is a slide. However, writing all that LATEX code can be unnecessarily time consuming. A simpler option is to write a Markdown document and compile it into an HTML5 slide show using pandoc, a great conversion utility written by John MacFarlane that is used from the command line.

Before it can be used, pandoc must be downloaded and installed from http://johnmacfarlane.net/pandoc/installing.html.

Pandoc can be used to convert files from one type to another. In our example we convert from Markdown to HTML5, in particular the slidy slide show format. (Other slide formats, such as s5, dzslides and slideous, are available.)

Slides are indicated by the header command (#), which also provides the slide title. While there are varying levels of headers, the highest level header in the deck that is immediately followed by content is used for slide titles. This can be overwritten by setting the --slide-level option when calling pandoc, which will be seen later. An example scenario would be using header 1 (#) to create sections, header 2 (##) to create subsections and header 3 (###) to create slides.

The first three lines of the Markdown file should each start with a percent symbol (%). The first is the title of the talk, the second is the author’s name and the third is the date. These are used to create the title slide.

Aside from these caveats, and a few others, regular Markdown should be used. An example slide show code follows.


% Example Slideshow
% Jared P. Lander
% April 14th, 2013

# First Section

### First Slide in First Section
A list of things to cover
* First Item
* Second Item
* Third Item

### Some R Code
The code below will generate some results and a plot.

```{r figure-plot,fig.cap="Simple plot of the numbers 1 through 10.",
fig.scap="Simple plot of the numbers 1 through 10",out.width='50%',
fig.show='hold'}
1 + 1
plot(1:10)
2 + 2
```

# Second Section

## First Subsection

### Another Slide
Some more information goes here

## Second Subsection

### Some Links
[My Website](http://www.jaredlander.com)

[R Bloggers](http://www.r-bloggers.com)


Running knit on this file, or pressing the Image button or Ctrl+Shift+H creates both an .md file and an .html file. Pandoc should be used on the .md file, which we will call example.md, with the following line of code from the command line.

pandoc -s -S --toc -t slidy --self-contained
    --slide-level 3 example.md -o output.html

This calls pandoc on example.md and creates output.html with a number of options. -s builds a stand-alone file, -S runs it in smart mode, --toc creates a table of contents, -t slidy makes the final product a slidy slide show, --self-contained puts all of the content into a single HTML file with no other files needed (even images are encoded directly into the file), --slide-level 3 means header 3 creates new slides, example.md specifies the input file and -o output.html provides the name for the output file.

This two-step process of generating the knitted Markdown file using knit (or the button or keyboard shortcut) and then going to the command line to run the preceding pandoc command can be tedious and error prone. Fortunately, at least for RStudio users, an option can be set to make this a one-step process. The following change to the R options makes the Knit button use pandoc for the conversion from Markdown to HTML.

> options(rstudio.markdownToHTML = function(inputFile, outputFile)
+ {
+ system(paste(
+ "pandoc -s -S --webtex --toc -t slidy --self-contained --slide-level 3",
+ shQuote(inputFile), "-o", shQuote(outputFile))
+ )
+ }
+ )

Now using the Knit button goes straight to the slide show format, which will even show up in the RStudio preview window.

Another alternative to using pandoc is the slidify package, written by Ramnath Vaidyanathan from McGill University. It uses a somewhat different syntax than pandoc but has a lot more power, and it even automatically changes the functionality of the Knit button in RStudio. Chunks of R code are still written as usual.

23.7. Conclusion

Writing reproducible, and maintainable, documents and slide shows from within R has never been easier, thanks to Yihui’s knitr package. It allows seamless integration of R code, with results including images and either LATEX or Markdown text.

On top of that, the RStudio IDE is a fantastic text editor. This entire book was written using knitr from within RStudio, without ever having to use Microsoft Word or a LATEX editor.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset