Chapter 4. Basic Expressions

Expressions are the building blocks of a function. R has a very clear syntax that suggests that an expression is either a symbol or a function call.

Although everything we do is in essence implemented by functions, R gives some functions a special syntax so that it is more friendly to write readable R code.

In the next few sections, we will see the following fundamental expressions that are given a special syntax:

  • Assignment expressions
  • Conditional expressions
  • Loop expressions

Assignment expressions

Assignment may be one of the most fundamental expressions in all programming languages. What it does is assign or bind a value to a symbol so that we can refer to the value by that symbol later.

Despite the similarity, R adopts the <- operator to perform assignment. This is a bit different from many other languages using = although this is also allowed in R:

x <- 1 
y <- c(1, 2, 3) 
z <- list(x, y) 

We don't have to declare the symbol and its type before assigning a value to it. If a symbol does not exist in the environment, the assignment will create that symbol. If a symbol already exists, the assignment will not end up in conflict, but will rebind the new value to that symbol.

Alternative assignment operators

There are some alternate yet equivalent operators we can use. Compared to x <- f(z), which binds the value of f(z) to symbol x, we can also use -> to perform assignment in the opposite direction:

2 -> x1 

We can even chain the assignment operators so that a set of symbols all take the same value:

x3 <- x2 <- x1 <- 0 

The expression 0 is evaluated only once so that the same value is assigned to the three symbols. To verify how it works, we can change0 to a random number generator:

x3 <- x2 <- x1 <- rnorm(1)
c(x1, x2, x3)
## [1] 1.585697 1.585697 1.585697 

The rnorm(1) method generates a random number following the standard normal distribution. If each assignment re-invokes the random number generator, each symbol will have different values. In fact, however, it does not happen. Later, I will explain what really happens and you will have a better understanding of it.

Like other programming languages, = also can perform assignment:

x2 = c(1, 2, 3) 

If you are familiar with other popular programming languages such as Python, Java, and C#, you may find it almost an industry standard to use = as the assignment operator and may feel uncomfortable using <-, which requires more typing. However, Google's R Style Guide (https://google.github.io/styleguide/Rguide.xml#assignment) suggests the usage of <- instead of =, even though both are allowed and have exactly the same effect when they are used as assignment operators.

Here, I will provide a simple explanation to the subtle difference between <- and =. Let's first create a f() function that takes two arguments:

f <- function(input, data = NULL) { 
  cat("input:
") 
  print(input) 
  cat("data:
") 
  print(data) 
} 

The function basically prints the value of the two arguments. Then, let's use this function to demonstrate the difference between the two operators:

x <- c(1, 2, 3)
y <- c("some", "text")
f(input = x)
## input: 
## [1] 1 2 3 
## data: 
## NULL 

The preceding code uses both <- and = operators but they play different roles. The <- operator in the first two lines is used as an assignment operator, while = in the third line specifies a named argument input for the f() method.

More specifically, the <- operator evaluates the expression on its right-hand side c(1, 2, 3) and assigns the evaluated value to the symbol (variable) on the left-hand side x. The = operator is not used as an assignment operator but to match the function argument by name.

We know that the <- and = operators are interchangeable when they are used as assignment operators. Therefore, the preceding code is equivalent to the following code:

x = c(1, 2, 3)
y = c("some", "text")
f(input = x)
## input: 
## [1] 1 2 3 
## data: 
## NULL 

Here, we only use the = operator but for two different purposes: in the first two lines, = performs an assignment, while in the third line = specifies the named argument.

Now, let's see what happens if we change every = to <-:

x <- c(1, 2, 3)
y <- c("some", "text")
f(input <- x)
## input: 
## [1] 1 2 3 
## data: 
## NULL 

If you run this code, you will find that the outputs are similar. However, if you inspect the environment, you will observe the difference: a new input variable is now created in the environment and gains the value of c(1, 2, 3):

input
## [1] 1 2 3 

So, what happened? Actually, in the third line, two things happened: First, the assignment, input <- x, introduces a new input symbol to the environment and results in x. Then, the value of input is provided to the first argument of function f(). In other words, the first function argument is not matched by name but by position.

To elaborate, we will conduct more experiments. The standard usage of the function is as follows:

f(input = x, data = y)
## input: 
## [1] 1 2 3 
## data: 
## [1] "some" "text" 

If we replace both = with <-, the result looks the same:

f(input <- x, data <- y)
## input: 
## [1] 1 2 3 
## data: 
## [1] "some" "text" 

For the code using =, we can exchange the two named arguments without changing the result:

f(data = y, input = x)
## input: 
## [1] 1 2 3 
## data: 
## [1] "some" "text" 

In this case, however, if we exchange = for <-, the values of input and data are also exchanged:

f(data <- y, input <- x)
## input: 
## [1] "some" "text" 
## data: 
## [1] 1 2 3 

The following code has the same effect as that of the preceding code:

data <- y
input <- x
f(y, x)
## input: 
## [1] "some" "text" 
## data: 
## [1] 1 2 3 

This code not only results in f(y, x), but unnecessarily creates additional data and input variables in the current environment.

From the preceding examples and experiments, the bottom line is clear. To reduce ambiguity, it is allowed to use either <- or = as the assignment operator and only use = to specify the named argument for functions. In conclusion, for better readability of R code, as the Google Style Guide suggests, only use <- for assignment and = to specify named arguments.

Using backticks with non-standard names

Assignment operators allow us to assign a value to a variable (or a symbol or name). However, direct assignment limits the format of the name. It contains only letters from a to z, A to Z (R is case-sensitive), the underscore(_), and dot(.), and it should not contain spaces or start with an underscore(_).

The following are some valid names:

students <- data.frame() 
us_population <- data.frame() 
sales.2015 <- data.frame() 

The following are invalid names due to violating naming rules:

some data <- data.frame() 
## Error: unexpected symbol in "some data" 
_data <- data.frame() 
## Error: unexpected input in "_" 
Population(Millions) <- data.frame() 
## Error in Population(Millions) <- data.frame() :  
##  object 'Millions' not found 

The preceding names violate the rules in different ways. The some data variable name contains a space, _data starts with _, and Population(Millions) is not a symbol name but a function call. In practice, it is quite likely that some invalid names might indeed be column names in a data table, such as the third name.

To walk around, we need to use back-ticks to quote the invalid names to make them valid:

`some data` <- c(1, 2, 3) 
`_data` <- c(4, 5, 6) 
`Population(Millions)` <- c(city1 = 50, city2 = 60) 

To refer to these variables, also use backticks; otherwise, they will still be regarded as invalid:

`some data`
## [1] 1 2 3
`_data`
## [1] 4 5 6
`Population(Millions)`
## city1city2 
##    50    60 

Backticks can be used wherever we create a symbol, irrespective of whether it is a function:

`Tom's secret function` <- function(a, d) { 
  (a ^ 2 - d ^ 2) / (a ^ 2 + d ^ 2)  
} 

It does not even matter if it is a list:

l1 <- list(`Group(A)` = rnorm(10), `Group(B)` = rnorm(10)) 

If the symbol name cannot be validly referred to directly, we also need to use quotation marks to refer to the symbol:

`Tom's secret function`(1,2)
## [1] -0.6
l1$`Group(A)`
##  [1] -0.8255922 -1.1508127 -0.7093875  0.5977409 -0.5503219 -1.0826915 
##  [7]  2.8866138  0.6323885 -1.5265957  0.9926590 

An exception is data.frame():

results <- data.frame(`Group(A)` = rnorm(10), `Group(B)` = rnorm(10))
results
##       Group.A.    Group.B. 
## 1  -1.14318956  1.66262403 
## 2  -0.54348588  0.08932864 
## 3   0.95958053 -0.45835235 
## 4   0.05661183 -1.01670316 
## 5  -0.03076004  0.11008584 
## 6  -0.05672594 -2.16722176 
## 7  -1.31293264  1.69768806 
## 8  -0.98761119 -0.71073080 
## 9   2.04856454 -1.41284611 
## 10  0.09207977 -1.16899586 

Unfortunately, even if we use backticks around a name with unusual symbols, the resulting data.frame variable will replace those symbols with the dots or using make.names(), a method that can be confirmed by looking at the column names of the resulting data.frame:

colnames(results)
## [1] "Group.A." "Group.B." 

This often happens when you import a table such as the following CSV data resulted from an experiment:

ID,Category,Population(before),Population(after) 
0,A,10,12 
1,A,12,13 
2,A,13,16 
3,B,11,12 
4,C,13,12 

When you read the CSV data using read.csv(), the Population(before) and Population(after) variable will not preserve their original names, but will change them to valid names in R using the make.names() method. To know what names we will get, we can run the following command:

make.names(c("Population(before)", "Population(after)"))
## [1] "Population.before." "Population.after."

Sometimes, this behavior is undesirable. To disable it, set check.names = FALSE when you call either read.csv() or data.frame():

results <- data.frame(
ID = c(0, 1, 2, 3, 4),
Category = c("A", "A", "A", "B", "C"),
`Population(before)` = c(10, 12, 13, 11, 13),
`Population(after)` = c(12, 13, 16, 12, 12),
stringsAsFactors = FALSE,
check.names = FALSE)
results
##    ID Category Population(before) Population(after)
## 1  0     A          10                  12
## 2  1     A          12                  13
## 3  2     A          13                  16
## 4  3     B          11                  12
## 5  4     C          13                  12
colnames(results)
## [1] "ID"    "Category"   "Population(before)" 
## [4] "Population(after)" 

In the preceding call, stringAsFactors = FALSE avoids converting character vectors to factors and check.names = FALSE avoids applying make.names() on the column names. With these two arguments, the data.frame variable created will preserve most aspects of the input data.

Just as I mentioned, to access the column with special symbols, use backticks to quote the name:

results$`Population(before)`
## [1] 10 12 13 11 13

Backticks make it possible to create and access variables, with symbols not allowed in direct assignment. This does not mean using such names is recommended. Rather, it can make the code harder to read and more error-prone, and it makes it more difficult to work with external tools that impose strict naming rules.

In conclusion, using backticks to create special variable names should be avoided unless absolutely necessary.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset