Chapter 3: The Fundamentals of R

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3

The Fundamentals of R

In This Chapter

Using functions and arguments

Making code clear and legible

Extending R with user packages

Before you start discovering the different ways you can use R on your data, you need to know a few more fundamental things about R.

In Chapter 2, we show you how to use the command line and work with the workspace, so if you’ve read that chapter, you can write a simple script and use the print(), paste(), and readline() functions — at least in the most basic way. But functions in R are more complex than that, so in this chapter we tell you how to get the most out of your functions.

As you add more arguments to your functions and more functions to your scripts, those scripts can become pretty complex. To keep your code clear — and yourself sane — you can follow the basic organizational principles we cover in this chapter.

Finally, much of R allows you to use other people’s code very easily. You can extend R with packages that have been contributed to the R community by hundreds of developers. In this chapter, we tell you where you can find these packages and how you can use them in R.

Using the Full Power of Functions

For every action you want to take in R, you use a function. In this section, we show you how you can use them the smart way. We start by telling how to use them in a vectorized way (basically, allow your functions to work on a whole vector of values at the same time, instead of just a single value). Then we tell you how you can reach a whole set of functionalities in R functions with arguments. Finally, we tell you how you can save the history of all the commands you’ve used in a session with — you guessed it! — a function.

Vectorizing your functions

Vectorized functions are a very useful feature of R, but programmers who are used to other languages often have trouble with this concept at first. A vectorized function works not just on a single value, but on a whole vector of values at the same time. Your natural reflex as a programmer may be to loop over all values of the vector and apply the function, but vectorization makes that unnecessary. Trust us: When you start using vectorization in R, it’ll help simplify your code.

To try vectorized functions, you have to make a vector. You do this by using the c() function, which stands for concatenate. The actual values are separated by commas.

Here’s an example: Suppose that Granny plays basketball with her friend Geraldine, and you keep a score of Granny’s number of baskets in each game. After six games, you want to know how many baskets Granny has made so far this season. You can put these numbers in a vector, like this:

> baskets.of.Granny <- c(12,4,4,6,9,3)

> baskets.of.Granny

[1] 12 4 4 6 9 3

To find the total number of baskets Granny made, you just type the following:

> sum(baskets.of.Granny)

[1] 38

You could get the same result by going over the vector number by number, adding each new number to the sum of the previous numbers, but that method would require you to write more code and it would take longer to calculate. You won’t notice it on just six numbers, but the difference will be obvious when you have to sum a few thousand of them.

In this example of vectorization, a function uses the complete vector to give you one result. Granted, this example is trivial (you may have guessed that sum() would accomplish the same goal), but for other functions in R, the vectorization may be less obvious.

A less obvious example of a vectorized function is the paste() function. If you make a vector with the first names of the members of your family, paste() can add the last name to all of them with one command, as in the following example:

> firstnames <- c(“Joris”, “Carolien”, “Koen”)

> lastname <- “Meys”

> paste(firstnames,lastname)

[1] “Joris Meys” “Carolien Meys” “Koen Meys”

R takes the vector firstnames and then pastes the lastname into each value. How cool is that? Actually, R combines two vectors. The second vector — in this case, lastname — is only one value long. That value gets recycled by the paste() function as long as necessary (for more on recycling, turn to Chapter 4).

You also can give R two longer vectors, and R will combine them element by element, like this:

> authors <- c(“Andrie”,”Joris”)

> lastnames <- c(“de Vries”,”Meys”)

> paste(authors,lastnames)

[1] “Andrie de Vries” “Joris Meys”

No complicated code is needed. All you have to do is make the vectors and put them in the function. In Chapter 5, we give you more information on the power of paste().

Putting the argument in a function

Most functions in R have arguments that give them more information about exactly what you want them to do. If you use print(“Hello world!”), you give the argument x of the print() function a value: “Hello world!”. Indeed, the first default argument of the print() function is called x. You can check this yourself by looking at the Help file of print().

In R, you have two general types of arguments:

Arguments with default values

Arguments without default values

If an argument has no default value, the value may be optional or required. In general, the first argument is almost always required. Try entering the following:

> print()

R tells you that it needs the argument x specified:

Error in .Internal(print.default(x, digits, quote, na.print, print.gap, : ‘x’ is missing

You can specify an argument like this:

> print(x = “Isn’t this fun?”)

Sure it is. But wait — when you entered the print(“Hello world!”) command in Chapter 2, you didn’t add the name of the argument, and the function worked. That’s because R knows the names of the arguments and just assumes that you give them in exactly the same order as they’re shown in the usage line of the Help page for that function. (For more information on reading the Help pages, turn to Chapter 11.)

If you type the values for the arguments in Help-page order, you don’t have to specify the argument names. You can list the arguments in any order you want, as long as you specify their names.

Try entering the following example:

> print(digits=4, x = 11/7)

[1] 1.571

You may wonder where the digits argument comes from, because it’s not explained in the Help page for print(). That’s because it isn’t an argument of the print() function itself, but of the function print.default(). Take a look again at the error you got if you typed print(). R mentions the print.default() function instead of the print() function.

In fact, print() is called a generic function. It determines the type of the object that’s given as an argument and then looks for a function that can deal with this type of object. That function is called the method for the specific object type. In case there is no specific function, R will call the default method. This is the function that works on all object types that have no specific method. In this case, that’s the print.default() function. Keep in mind that a default method doesn’t always exist. We explain this in more detail in Chapter 8. For now, just remember that arguments for a function can be shown on the Help pages of different methods.

If you forgot which arguments you can use, you can find that information in the Help files. Don’t forget to look at the arguments of specific methods as well. You often find a link to those specific methods at the bottom of the Help page.

Making history

By default, R keeps track of all the commands you use in a session. This tracking can come in handy if you need to reuse a command you used earlier or want to keep track of the work you did before. These previously used commands are kept in the history.

You can browse the history from the command line by pressing the up-arrow and down-arrow keys. When you press the up-arrow key, you get the commands you typed earlier at the command line. You can press Enter at any time to run the command that is currently displayed.

Saving the history is done using the savehistory() function. By default, R saves the history in a file called .Rhistory in your current working directory. This file is automatically loaded again the next time you start R, so you have the history of your previous session available.

If you want to use another filename, use the argument file like this:

> savehistory(file = “Chapter3.Rhistory”)

Be sure to add the quotation marks around the filename.

You can open an Explorer window and take a look at the history by opening the file in a normal text editor, like Notepad.

You don’t need to use the file extension .Rhistory — R doesn’t care about extensions that much. But using .Rhistory as a file extension will make it easier to recognize as a history file.

If you want to load a history file you saved earlier, you can use the loadhistory() function. This will replace the history with the one saved in the .Rhistory file in the current working directory. If you want to load the history from a specific file, you use the file argument again, like this:

> loadhistory(“Chapter3.Rhistory”)

Keeping Your Code Readable

You may wonder why you should bother about reading code. You wrote the code yourself, so you should know what it does, right? You do now, but will you be able to remember what you did if you have to redo that analysis six months from now on new data? Besides, you may have to share your scripts with other people, and what seems obvious to you may be far less obvious for them.

Some of the rules you’re about to see aren’t that strict. In fact, you can get away with almost anything in R, but that doesn’t mean it’s a good idea. In this section, we explain why you should avoid some constructs even though they aren’t strictly wrong.

Following naming conventions

R is very liberal when it comes to names for objects and functions. This freedom is a great blessing and a great burden at the same time. Nobody is obliged to follow strict rules, so everybody who programs something in R can basically do as he or she pleases.

Choosing a correct name

Although almost anything is allowed when giving names to objects, there are still a few rules in R that you can’t ignore:

Names must start with a letter or a dot. If you start a name with a dot, the second character can’t be a digit.

Names should contain only letters, numbers, underscore characters (_), and dots (.). Although you can force R to accept other characters in names, you shouldn’t, because these characters often have a special meaning in R.

You can’t use the following special keywords as names:

• break

• else

• FALSE

• for

• function

• if

• Inf

• NA

• NaN

• next

• repeat

• return

• TRUE

• while

R is case sensitive, which means that, for R, lastname and Lastname are two different objects. If R tells you it can’t find an object or function and you’re sure it should be there, check to make sure you used the right case.

Choosing a clear name

When Joris was young, his parents bought a cute little lamb that needed a name. After much contemplation, he decided to call it Blacky. Never mind that the lamb was actually white and its name made everybody else believe that it was a dog; Joris thought it was a perfect name.

Likewise, calling the result of a long script Blacky may be a bit confusing for the person who has to read your code later on, even if it makes all kinds of sense to you. Remember: You could be the one who, in three months, is trying to figure out exactly what you were trying to achieve. Using descriptive names will allow you to keep your code readable.

Although you can name an object whatever you want, some names will cause less trouble than others. You may have noticed that none of the functions we’ve used until now are mentioned as being off-limits (see the preceding section). That’s right: If you want to call an object paste, you’re free to do so:

> paste <- paste(“This gets”,”confusing”)

> paste

[1] “This gets confusing”

> paste(“Don’t”,”you”,”think?”)

[1] “Don’t you think?”

R will always know perfectly well when you want the vector paste and when you need the function paste(). That doesn’t mean it’s a good idea to use the same name for both items, though. If you can avoid giving the name of a function to an object, you should.

One situation in which you can really get into trouble is when you use capital F or T as an object name. You can do it, but you’re likely to break code at some point. Although it’s a very bad idea, T and F are too often used as abbreviations for TRUE and FALSE, respectively. But T and F are not reserved keywords. So, if you change them, R will first look for the object T and only then try to replace T with TRUE. And any code that still expects T to mean TRUE will fail from this point on. Never use F or T, not as an object name and not as an abbreviation.

Choosing a naming style

If you have experience in programming, you’ve probably heard of camel case, before. It’s a way of giving longer names to objects and functions. You capitalize every first letter of a word that is part of the name to improve the readability. So, you can have a veryLongVariableName and still be able to read it.

Contrary to many other languages, R doesn’t use the dot (.) as an operator, so the dot can be used in names for objects as well. This style is called dotted style, where you write everything in lowercase and separate words or terms in a name with a dot. In fact, in R, many function names use dotted style. You’ve met a function like this earlier in the chapter: print.default(). Some package authors also use an underscore instead of a dot.

print.default() is the default method for the print() function. Information on the arguments is given on the Help page for print.default().

Consistent inconsistency

You would expect the function to be called save.history(), but it’s called savehistory() without the dot. Likewise, you would expect a function R.version(), but instead it’s R.Version(). R.Version() will give you all the information on the version of R you’re running, including the platform you’re running it on. Sometimes, the people writing R use camel case: If you want to get only the version number of R, you have to use the function getRversion(). Some package authors choose to use underscores (_) instead of dots for separation of the words; this style is used often within some packages we discuss later in this book (for example, the ggplot2 package in Chapter 18).

You’re not obligated to use dotted style; you can use whatever style you want. We use dotted style throughout this book for objects and camel case for functions. R uses dotted style for many base functions and objects, but because some parts of the internal mechanisms of R rely on that dot, you’re safer to use camel case for functions. Whenever you see a dot, though, you don’t have to wonder what it does — it’s just part of the name.

The whole naming issue reveals one of the downsides of using open-source software: It’s written by very intelligent and unselfish people with very strong opinions, so the naming of functions in R is far from standardized.

Structuring your code

Names aren’t the only things that can influence the readability of your code. When you start nesting functions or perform complex calculations, your code can turn into a big mess of text and symbols rather quickly. Luckily, you have some tricks to clear up your code so you can still decipher what you did three months down the road.

Nesting functions and doing complex calculations can lead to very long lines of code. If you want to make a vector with the names of your three most beloved song titles, for example, you’re already in for trouble. Luckily, R lets you break a line of code over multiple lines in your script, so you don’t have to scroll to the right the whole time.

You don’t even have to use a special notation or character. R will know that the line isn’t finished as long as you give it some hint. Generally speaking, you have to make sure the command is undoubtedly incomplete. There are several ways to do that:

You can use a quotation mark to start a string. R will take all the following input — including the line breaks — as part of the string, until it meets the matching second quotation mark.

You can end the incomplete line with an operator (like +, /, <-, and so on). R will know that something else must follow. This lets you create structure in longer calculations.

You can open a parenthesis for a function. R will read all the input it gets as one line until it meets the matching parenthesis. This allows you to line up arguments below a function, for example.

The following little script shows all these techniques:

baskets.of.Geraldine <-

c(5,3,2,2,12,9)

Intro <- “It is amazing! The All Star Grannies scored

a total of”

Outro <- “baskets in the last six games!”

Total.baskets <- baskets.of.Granny +

baskets.of.Geraldine

Text <- paste(Intro,

sum(Total.baskets),

Outro)

cat(Text)

You can copy this code into a script file and run it in the console. If you run this little snippet of code, you see the following output in the console:

It is amazing! The All Star Grannies scored

a total of 71 baskets in the last six games!

This immediately shows what the cat() function does. It prints whatever you give it as an argument directly to the console. It also interprets special characters like line breaks and tabs. If you look at the vector Text, you would see this:

> Text

[1] “It is amazing! The All Star Grannies scored a total of 71 baskets in the last six games!”

The represents the line break. Even though it’s pasted to the a, R will recognize as a separate character. (You can find more information on special characters in Chapter 12.)

All this also works at the command line. If you type an unfinished command, R will change the prompt to a + sign, indicating that you can continue to type your command:

> cat(‘If you doubt whether it works,

+ just try it out.’)

If you doubt whether it works,

just try it out.

RStudio automatically adds a line break at the end of a cat() statement if there is none, but R doesn’t do that. So, if you don’t use RStudio, remember to add a line break (or the symbol ) at the end of your string.

Adding comments

Often, you want to add a bit of extra information to a script file. You may want to tell who wrote it and when. You may want to explain what the code does and what all the variable names mean.

You can do this by typing that information after the hash (#). R ignores everything that appears after the hash. You can use the hash at the beginning of a line or somewhere in the middle. Run the following script, and see what happens:

# The All Star Grannies do it again!

baskets.of.Granny <- c(12,4,4,6,9,3) # Granny rules

sum(baskets.of.Granny) # total number of points

R has no specific construct to spread a comment over multiple lines. You’ll have to precede every line of the comment block with a hash symbol (#). In RStudio, you can easily comment or uncomment several lines together by selecting them and pressing Ctrl+/. Other editors have similar shortcuts.

Getting from Base R to More

Until now, you’ve used only functions that are available in the basic installation of R. But the real power of R lies in the fact that everyone can write his or her own functions and share them with other R users in an organized manner. Many knowledgeable people have written convenient functions with R, and often a new statistical method is published together with R code. Most of these authors distribute their code as R packages (collections of R code, Help files, datasets, and so on, that can be incorporated easily into R itself). In this section, we tell you how to find and add packages to your R installation.

Finding packages

Several websites, called repositories, offer a collection of R packages. The most important repository is the Comprehensive R Archive Network (CRAN; http://cran.r-project.org), which you can access easily from within R.

In addition to housing the installation files for R itself (see the appendix of this book) and a set of manuals for R, CRAN contains a collection of package files and the reference manuals for all packages. For some packages, a vignette (which gives you a short introduction to the use of the functions in the package) is also available. Finally, CRAN lets you check whether a package is still maintained and see an overview of the changes made in the package. CRAN is definitely worth checking out!

Installing packages

You install a package in R with the function — wait for it — install.packages(). Who could’ve guessed? So, to install the fortunes package, for example, you simply give the name of the package as a string to the install.packages() function.

The fortunes package contains a whole set of humorous and thought- provoking quotes from mailing lists and help sites. You install the package like this:

> install.packages(‘fortunes’)

R may ask you to specify a CRAN mirror. Because everyone in the whole world has to access the same servers, CRAN is mirrored on more than 80 registered servers, often located at universities. Pick one that’s close to your location, and R will connect to that server to download the package files. In RStudio, you can set the mirror by choosing Tools⇒Options.

Next, R gives you some information on the installation of the package:

Installing package(s) into ‘D:/R/library’(as ‘lib’ is unspecified)

....

opened URL

downloaded 165 Kb

package ‘fortunes’ successfully unpacked and MD5 sums checked

....

It tells you which directory (called a library) the package files are installed in, and it tells you whether the package was installed successfully. Granted, it does so in a rather technical way, but the word successfully tells you everything is okay.

Loading and unloading packages

After a while, you can end up with a collection of many packages. If R loaded all of them at the beginning of each session, that would take a lot of memory and time. So, before you can use a package, you have to load it into R by using the library() function.

You load the fortunes package like this:

> library(fortunes)

You don’t have to put single quotation marks around the package name when using library(), but it may be wise to do so.

Now you can use the functions from this package at the command line, like this:

> fortune(“This is R”)

The library is the directory where the packages are installed. Never, ever call a package a library. That’s a mortal sin in the R community. Take a look at the following, and never forget it again:

> fortune(161)

You can use the fortune() function without arguments to get a random selection of the fortunes available in the package. It’s a nice read.

If you want to unload a package, you’ll have to use some R magic. The detach() function will let you do this, but you have to specify that it’s a package you’re detaching, like this:

> detach(package:fortunes)

A package is as good as its author

Many people contribute in one way or another to R. As in any open-source community, there are people with very strong coding skills and people with especially heartwarming enthusiasm. R itself is tested thoroughly, and packages available on CRAN are checked for safety and functionality. This means that you can safely download and use those packages without fear of breaking your R installation or — even worse — your computer.

It doesn’t mean, however, that the packages always do what they claim to do. After all, even programmers are human, and they make mistakes. Before you use a new package, you may want to test it out using an example where you know what the outcome should be.

But the fact that many people use R is an advantage. Whereas reporting errors to a huge company can be a proverbial pain in the lower regions, you can reach the authors of packages by e-mail. Users report bugs and errors all the time, and package authors continue to update and improve their packages. Overall, the quality of the packages is at least as good as, if not better than, the quality of commercial applications. After all, the source code of every package is readily available for anyone to check and correct if he or she feels like. Both R and the packages used in R are improved with contributions from users as well.

So, in fact, we could have titled this sidebar “A package is as good as the community that uses it.” The R community is simply fantastic — and you should know. By buying this book, you officially became a member.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: The Fundamentals of R

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 3: The Fundamentals of R