Preface

We import a dataset into a statistical software package, run a procedure to get all results, then copy and paste selected pieces into a typesetting program, add a few descriptions, and finish a report. This is a common practice in writing statistical reports. There are obvious dangers and disadvantages in this process.

1.  It is error-prone due to too much manual work.

2.  It requires lots of human effort to do tedious jobs such as copying results across documents.

3.  The workflow is barely recordable especially when it involves GUI (Graphical User Interface) operations, therefore it is difficult to reproduce.

4.  A tiny change of the data source in the future will require the author(s) to go through the same procedure again, which can take nearly the same amount of time and effort.

5.  The analysis and writing are separate, so close attention has to be paid to the synchronization of the two parts.

In fact, a report can be generated dynamically from program code. Just like a software package has its source code, a dynamic document is the source code of a report. It is a combination of computer code and the corresponding narratives. When we compile the dynamic document, the program code in it is executed and replaced with the output; we get a final report by mixing the code output with the narratives. Because we only manage the source code, we are free of all the possible problems above. For example, we can change a single parameter in the source code, and get a different report on the fly.

In this book, dynamic documents refer to the kind of source documents containing both program code and narratives. Sometimes we may just call them source documents since “dynamic” may sound con-fusing and ambiguous to some people (it does not mean interactivity or animations). We also use the term report frequently throughout the book, which really means the output document that was compiled from a dynamic document.

Who Should Read This Book

This book is written for both beginners and advanced users. The main goal is to make writing reports easier: the “report” here can range from student homework or project reports, exams, books, blogs, and Web pages to virtually any documents related to statistical graphics, computing, and data analysis.

For beginners, Chapters 1 to 8 should be enough for basic applications (which have already covered many features); for power users, Chapters 9 to 11 can be helpful for understanding the extensibility of the knitr package.

Familiarity with Image and HTML can be helpful, but is not required at all. Once you get the basic idea, you can write reports in simple languages such as Markdown, which should be fairly easy for beginners to learn. Unless otherwise noted, all features apply to all document formats, although we primarily use Image for examples.

We recommend that readers take a look at the website RPubs (http://rpubs.com), which contains a large number of user-contributed documents. Hopefully they are convincing enough to show that it is quick and easy to write dynamic documents.

Software Information and Conventions

The main tools we introduce in this book are the R language (R Core Team, 2015) and the knitr package (Xie, 2015b), with which this book was written, but the language in the documents is not restricted to R; for example, we can also integrate Python, awk, and shell scripts, etc., into the reports. For document formats, we mainly use Image, HTML, and Markdown.

Both R and knitr are available on CRAN (Comprehensive R Archive Network) as free and open-source software. You may download them from any CRAN mirrors, such as http://cran.rstudio.com. You can find their version information for this book in the R session information below:

Image

The knitr package is thoroughly documented on the website http://yihui.name/knitr/, and the most important page is perhaps http://yihui.name/knitr/options, where you can find the complete reference for chunk options (Section 5.1.1). The development version is hosted on Github: https://github.com/yihui/knitr; you can always check out the latest development version, file issues/feature requests, or even participate in the development by forking the repository and making changes by yourself. There are plenty of examples in the repository https://github.com/yihui/knitr-examples, including both minimal and advanced examples. Karl Broman prepared a very nice minimal tutorial for knitr at http://kbroman.org/knitr_knutshell, which can be useful for beginners to learn knitr quickly. There is also a wiki page maintained by Frank Harrell et al. from the Department of Biostatistics, Vanderbilt University, which introduced several tricks and useful experience of using knitr: http://biostat.mc.vanderbilt.edu.

Unlike many other books on R, we do not add prompts to R source code in this book, and we comment out the text output by two hashes ## by default, as you can see from the R session information before. The reason for this convention is explained in Chapter 6. Package names are in bold text (e.g., rpart), function names in italic (e.g., paste()), inline code is formatted in a typewriter font (e.g., mean(1:10, trim = 0.1)), and filenames are in sans serif fonts (e.g., figure/foo.pdf).

Structure of the Book

Chapter 1 is an overview of dynamic documents, introducing the idea of literate programming; Chapter 2 explains why dynamic documents are important to scientific research from the viewpoint of reproducible research; Chapter 3 gives a first complete example that covers basic concepts and what we can do with knitr; Chapter 4 introduces a few common text editors that support knitr, so that it is easier to compile reports from source documents; and Chapter 5 describes the syntax for different document formats such as Image, HTML, and Markdown.

Chapters 6 to 11 explain the core functionality of the package. Chapters 6 and 7 present how to control text and graphics output from knitr. Chapter 8 talks about the caching mechanism that may significantly reduce the computation time. Chapter 9 shows how to reuse source code by chunk references and organize child documents. Chapter 10 consists of an advanced topic — chunk hooks, which make a literate programming document really programmable and extensible. Chapter 11 illustrates how to integrate other languages, such as Python and awk, etc., into one report in the knitr framework.

Chapter 12 introduces some useful tricks that make it easier to write documents with knitr. Chapter 13 shows how to publish reports in a variety of formats including PDF, HTML, and HTML5 slides. Chapter 14 focuses on R Markdown v2, which can be converted to a large variety of document formats, including those in Chapter 13. Chapter 15 covers a few significant applications. Chapter 16 introduces other tools for dynamic report generation, such as Sweave, other R packages, and software in other languages. Appendix A is a guide to some internal structures of knitr, which may be helpful to other package developers.

The topics from Chapters 6 to 11 are parallel to each other. For example, if you want to know more about graphics output, you can skip Chapter 6 and jump to Chapter 7 directly.

In all, we will show how to improve our efficiency in writing reports, fine tune every aspect of a report, and go from program output to publication-quality reports.

What’s New in the Second Edition

The major new content in the second edition of this book is Chapter 14, which is an introduction to R Markdown v2. Then there are a few new sections: 6.3 (how to generate tables), 6.4 (how to define custom printing methods for objects in code chunks), 11.2.2 (the C/Fortran engines), 11.2.4 (the Stan engine), 11.3 (how to run engines in a persistent session), and 15.2 (how to start a local server to serve dynamic documents). There are many minor updates here and there in the book as well.

The second edition also introduces several changes according to the changes in the knitr package (the first edition was based on knitr 1.3).

•  The default value of the chunk option tidy was changed from TRUE to FALSE, i.e., code chunks will not be automatically reformatted by default (Section 6.2.2).

•  Inline R expressions are evaluated without try(), i.e., if an error occurs during the inline evaluation, R will stop immediately.

•  The global R option digits is no longer modified in knitr; its default value is 7, and you can set options(digits =4) if you want the old behavior.

•  The plot hook function takes the plot filename as its first argument (Section 5.3), instead of a vector of length two (basename and extension).

•  The preferred way to stop knitr in case of errors is to set the chunk option error = FALSE instead of the package option stop_on_error, which has been deprecated (Section 6.2.4).

•  Syntax highlighting is also available for other languages (Chapter 11) such as Shell scripts, awk, and Python, etc., if the Highlight package is installed (Section 11.2.7).

•  For external code chunks (Section 9.2), the preferred chunk delimiter is ## ----instead of ## @knitr now.

To keep track of the changes in knitr, you can see the release notes for each version at https://github.com/yihui/knitr/releases.

Acknowledgments

First, I want to thank my wireless router, which was broken when I started writing the core chapters of the first edition of this book (in the boring winter of Ames). Besides, I also thank my wife for not giving me the Ethernet cable during that period.

This book would certainly not have been possible without the powerful R language, for which I thank the R core team and its contributors. The seminal work of Sweave (by Friedrich Leisch and R-core) is the most important source of inspiration of knitr. Some additional features were inspired by other R packages including cacheSweave (Roger Peng), pgfSweave (Cameron Bracken and Charlie Sharpsteen), weaver (Seth Falcon), SweaveListingUtils (Peter Ruckdeschel), highlight (Romain Francois), and brew (Jeffrey Horner). The initial design was based on Hadley Wickham’s decumar package, and the evaluator is based on his evaluate package. Both Image and RStudio quickly included support to knitr after it came out, which made it a lot easier to write source documents, and I’d like to thank their developers (especially Jean-Marc Lasgouttes, JJ Allaire, and Joe Cheng); similarly I thank the developers of other editors such as Emacs/ESS. I do not know how to describe John MacFarlane’s Pandoc. It is magic. “Yes, we do support Word! Welcome to the world of reproducible research!”

The R/knitr user community is truly amazing. There has been a lot of feedback since the beginning of its development in late 2011. I still remember some users shouted it from the rooftops when I released the first beta version. I appreciate this kind of excitement. Thousands of questions and comments in the mailing list (https://groups.google.com/group/knitr) and on the website StackOverflow (http://stackoverflow.com/tags/knitr/) made this package far more powerful than I imagined. The development repository is on Github, where I have received nearly 800 issues and more than 160 pull requests from many contributors, including Ramnath Vaidyanathan, Taiyun Wei, Kirill Müller, and JJ Allaire (https://github.com/yihui/knitr/pulls).

Image

I thank my PhD advisors at Iowa State University, Di Cook and Heike Hofmann, for their open-mindedness and consistent support for my research in this “non-classical” area of statistics. I also thank RStudio (http://www.rstudio.com) for providing me the freedom to work on the second edition of this book.

Lastly, I thank the reviewers Frank Harrell, Douglas Bates, Carl Boettiger, Joshua Wiley, Scott Kostyshak, and Jim Robison-Cox for their valuable advice on improving the quality of this book (which is the first book of my career), and I’m grateful to my editor John Kimmel, without whom I would not have been able to publish my first book quickly.

Yihui Xie
Ames, Iowa

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset