Writing R programs

Although much data analysis in R can be carried out in an interactive manner using command prompt, often for more complex tasks, one needs to write R scripts. As mentioned in the introduction, R has both the perspective of a functional and object-oriented programming language. In this section, some of the standard syntaxes of the programming in R are described.

Control structures

Control structures are meant for controlling the flow of execution of a program. The standard control structures are as follows:

  • if and else: To test a condition
  • for: To loop over a set of statements for a fixed number of times
  • while: To loop over a set of statements while a condition is true
  • repeat: To execute an infinite loop
  • break: To break the execution of a loop
  • next: To skip an iteration of a loop
  • return: To exit a function

Functions

If one wants to use R for more serious programming, it is essential to know how to write functions. They make the language more powerful and elegant. R has many built-in functions, such as mean(), sort(), sin(), plot(), and many more, which are written using R commands.

A function is defined as follows:

>fname<-function(arg1,arg2,   ){
        R Expressions
   }

Here, fname is the name of the function; arg1, arg2, and so on, are arguments passed to the function. Note that unlike in other languages, functions in R do not end with a return statement. By default, the last statement executed inside the body of the function is returned by the function.

Once a function is defined, it is executed simply by entering the function name with the values for the arguments:

>fname(arg1,arg2,…)

The important properties of functions in R are as follows:

  • Functions are first-class citizens
  • Functions can be passed as arguments to other functions
  • One can define a function inside another function (nesting)
  • The arguments of the functions can be matched by position or name

Let's consider a simple example of a function, which given an input vector x, calculates its mean. To write this function, open a new window in RStudio for R script from the menu bar through File | New File | R Script. In this R script, enter the following lines of code:

myMean <-function(x){
    s <-sum(x)
    l <-length(x)
    mean <-s/l
    mean
}

Select the entire code and use the keys Ctrl + Enter to execute the script. This completes the definition of the myMean function. To use this function on the command prompt, enter the following:

>x <-c(10,20,30,40,50)
>myMean(x)

This will generate the following result:

>myMean(x)
[1]  30

Scoping rules

In programming languages, it is very important to understand the scopes of all variables to avoid errors during execution. There are two types of scoping of a variable in a function: lexical scoping and dynamic scoping. In the case of lexical scoping, the value of a variable in a function is looked up in the environment in which the function was defined. Generally, this is the global environment. In the case of dynamic scoping, the value of a variable is looked up in the environment in which the function was called (the calling environment).

R uses lexical scoping that makes it possible to write functions inside a function. This is illustrated with the following example:

>x <-0.1
>f <-function(y){
          x*y
    }
>g <-function(y){
          x<-5
          x-f(y)
    }
>g(10)
[1]  4

The answer is 4 because while evaluating function f, the value of x is taken from the global environment, which is 0.1, whereas while evaluating function g, the value of x is taken from the local environment of g, which is 5.

Lexical scoping has some disadvantages. Since the value of a variable is looked up from the environment in which the function is defined, all functions must carry a pointer to their respective defining environments. Also, all objects must be stored in memory during the execution of the program.

Loop functions

Often, we have a list containing some objects and we want to apply a function to every element of the list. For example, we have a list of results of a survey, containing m questions from n participants. We would like to find the average response for each question (assuming that all questions have a response as numeric values). One could use a for loop over the set of questions and find an average among n users using the mean() function in R. Loop functions come in handy in such situations and one can do such computations in a more compact way. These are like iterators in other languages such as Java.

The following are the standard loop functions in R:

  • lapply: To loop over a list and evaluate a function on each element
  • sapply: The same as lapply, but with the output in a more simpler form
  • mapply: A multivariate version of sapply
  • apply: To apply functions over array margins
  • tapply: To apply a function to each cell of a ragged array

lapply

The lapply() function is used in the following manner:

>lapply(X,FUN,   )

Here, X is a list or vector containing data. The FUN is the name of a function that needs to be applied on each element of the list or vector. The last argument represents optional arguments. The result of using lapply is always a list, regardless of the type of input.

As an example, consider the quarterly revenue of four companies in billions of dollars (not real data). We would like to compute the yearly average revenue of all four companies as follows:

>X<-list(HP=c(12.5,14.3,16.1,15.4),IBM=c(22,24.5,23.7,26.2),Dell=c(8.9,9.7,10.8,11.5),Oracle=c(20.5,22.7,21.8,24.4)  )
>lapply(X,mean)
$HP
[1]  14.575

$IBM
[1]  24.1

$Dell
[1]  10.225

$Oracle
[1]  22.35

sapply

The sapply() function is similar to lapply() with the additional option of simplifying the output into a desired form. For example, sapply() can be used in the previous dataset as follows:

> sapply(X,mean,simplify="array")
     HP       IBM       Dell      Oracle
   14.575    24.100    10.225     22.350

mapply

The lapply() and sapply() functions can only have one argument. If you want to apply a function with multiple variable arguments, then mapply() becomes handy. Here is how it is used:

>mapply(FUN,L1,L2,   ,Ln,SIMPLIFY=TRUE)

Here, mapply are the lists to which the function FUN needs to be applied. For example, consider the following list generation command:

>rep(x=10,times=5)
[1] 10 10 10 10 10

Here, the rep function repeats the value of x five times. Suppose we want to create a list where the number 10 occurs 1 time, the number 20 occurs 2 times, and so on, we can use mapply as follows:

>mapply(rep,x=c(10,20,30,40,50),times=1:5)

apply

The apply() function is useful for applying a function to the margins of an array or matrix. The form of the function is as follows:

>apply(X,MARGIN,FUN,   )

Here, MARGIN is a vector giving the subscripts that the function will be applied over. For example, in the case of a matrix, 1 indicates rows and 2 indicates columns, and c(1,2) indicates rows and columns. Consider the following example as an illustration:

>Y <-matrix(1:9,nrow=3,ncol=3)
>Y
          [,1]        [,2]          [,3]
[1,]        1           4             7
[2,]        2           5             8
[1,]        3           6             9
>apply(Y,1,sum) #sum along the row
[1]  12 15 18
>apply(Y,2,sum) #sum along the column
[1]  6 15 24

tapply

The tapply() function is used to apply a function over the subsets of a vector. The function description is as follows:

>tapply(X,INDEX,FUN,SIMPLIFY=TRUE)

Let us consider the earlier example of the quarterly revenue of five companies:

>X<-X(HP=c(12.5,14.3,16.1,15.4),IBM=c(22,24.5,23.7,26.2),Dell=c(8.9,9.7,10.8,11.5),Oracle=c(20.5,22.7,21.8,24.4)  )

Using lapply(), we found the average yearly revenue of each company. Suppose we want to find the revenue per quarter averaged over all four companies, we can use tapply() as follows; here we use the function c instead of the list to create X:

>X<-c(HP=c(12.5,14.3,16.1,15.4),IBM=c(22,24.5,23.7,26.2),Dell=c(8.9,9.7,10.8,11.5),Oracle=c(20.5,22.7,21.8,24.4)  )

>f<-factor(rep(c("Q1","Q2","Q3","Q4"),times=4) ) 
>f
[1]  Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
Levels Q1 Q2 Q3 Q4

>tapply(X,f,mean,simplify=TRUE)
Q1           Q2         Q3        Q4
15.97      17.80      18.10     19.37

By creating the factor list with levels as quarter values, we can apply the mean function for each quarter using tapply().

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset