Understanding the copy-on-modify mechanism

In the previous section, we showed how lazy evaluation works and how it may help save computing time and working memory by avoiding unnecessary evaluation of function arguments. In this section, I will show you an important feature of R that makes it safer to work with data. Suppose we create a simple numeric vector x1:

x1 <- c(1, 2, 3) 

Then, we assign the value of x1 to x2:

x2 <- x1 

Now, x1 and x2 have exactly the same value. What if we modify an element in one of the two vectors? Will both vectors change?

x1[1] <- 0
x1
## [1] 0 2 3
x2
## [1] 1 2 3 

The output shows that when x1 is changed, x2 will remain unchanged. You may guess that the assignment automatically copies the value and makes the new variable point to the copy of the data instead of the original data. Let's use tracemem() to track the footprint of the data in memory.

Let's reset the vectors and conduct an experiment by tracing the memory addresses of x1 and x2:

x1 <- c(1, 2, 3)
x2 <- x1 

As we call tracemem() on the two vectors, it shows the current memory address of the data. If the memory address being traced changes, a text will show up with the original address and the new address, indicating that the data is copied:

tracemem(x1)
## [1] "<0000000013597028>"
tracemem(x2)
## [1] "<0000000013597028>" 

Now, both vectors have the same value, and x1 and x2 share the same address, which implies that they point to exactly the same piece of data in memory and that the assignment operation does not copy the data automatically. But when is the data copied?

Now, we will modify the first element of x1 to 0:

x1[1] <- 0
## tracemem[0x0000000013597028 -> 0x00000000170c7968] 

The memory tracing says that the address of x1 has changed to a new one. More specifically, the piece of memory, that is, the original vector both x1 and x2 point to is copied to a new location. Now we have two copies of the same data in two different locations. Then, the first element of the copy is modified, and finally, x1 is made to point to the modified copy.

Now, x1 and x2 have different values: x1 points to the modified vector and x2 remains pointing to the original vector.

In other words, if multiple variables refer to the same object, modifying one variable will make a copy of the object. This mechanism is called copy-on-modify.

Another scenario where copy-on-modify happens is when we modify a function argument. Suppose we create the following function:

modify_first <- function(x) {
  x[1] <- 0
  x
} 

When the function is executed, it attempts to modify the first element of argument x. Let's do some experiments with vectors and lists and see whether modify_first() can modify them.

For a number vector v1:

v1 <- c(1, 2, 3)
modify_first(v1)
## [1] 0 2 3
v1
## [1] 1 2 3 

For a list v2:

v2 <- list(x = 1, y = 2)
modify_first(v2)
## $x
## [1] 0
##
## $y
## [1] 2
v2
## $x
## [1] 1
##
## $y
## [1] 2 

In both experiments, the function only returned a modified version of the original object, but it did not modify the original object. However, directly modifying the vectors outside the function works:

v1[1] <- 0
v1
## [1] 0 2 3
v2[1] <- 0
v2
## $x
## [1] 0
##
## $y
## [1] 2 

To use the modified version, we need to assign it to the original variable:

v3 <- 1:5
v3 <- modify_first(v3)
v3
## [1] 0 2 3 4 5 

The preceding examples demonstrate that modifying a function argument also causes a copy to make sure that the modification does not affect things outside the function.

The copy-on-modify mechanism also happens when the attributes are modified. The following function removes the row names of a data frame and replaces its column names with capital letters:

change_names <- function(x) {
  if (is.data.frame(x)) {
    rownames(x) <- NULL
    if (ncol(x) <= length(LETTERS)) {
      colnames(x) <- LETTERS[1:ncol(x)]
    } else {
      stop("Too many columns to rename")
    }
  } else {
    stop("x must be a data frame")
  }
  x
} 

To test the function, we will create a simple data frame with randomly generated data:

small_df <- data.frame(
  id = 1:3,
  width = runif(3, 5, 10),
  height = runif(3, 5, 10))
small_df
##   id    width   height
## 1  1  7.605076 9.991836
## 2  2  8.763025 7.360011
## 3  3  9.689882 8.550459 

Now, we will call the function with the data frame and see the modified version:

change_names(small_df)
##   A     B        C
## 1 1 7.605076 9.991836
## 2 2 8.763025 7.360011
## 3 3 9.689882 8.550459 

According to the copy-on-modify mechanism, small_df is copied the first time when its row names are removed, and then, all subsequent changes are made to the copied version instead of the original version. We can verify this by viewing small_df:

small_df
##   id   width   height
## 1  1 7.605076 9.991836
## 2  2 8.763025 7.360011
## 3  3 9.689882 8.550459 

The original version has not changed at all.

Modifying objects outside a function

Despite the copy-on-modify mechanism, it is still possible to modify a vector outside a function. The <<- operator is designed to do the job. Suppose we have a variable x and create a function modify_x() that simply assigns a new value to x:

x <- 0
modify_x <- function(value) {
  x <<- value
} 

When we call the function, the value of x will be replaced:

modify_x(3)
x
## [1] 3 

This can be useful when you try to map a vector to a new list and do some counting at the same time. The following code creates a list of vectors with an increasing number of elements. In each iteration of lapply()count is used to sum up the total number of elements in the vector generated:

count <- 0
lapply(1:3, function(x) {
  result <- 1:x
  count <<- count + length(result)
  result
})
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1 2
##
## [[3]]
## [1] 1 2 3
count
## [1] 6 

Another example in which <<- is useful is to flatten a nested list. Suppose we have a nested list like the one shown here:

nested_list <- list(
  a = c(1, 2, 3),
  b = list(
    x = c("a", "b", "c"),
    y = list(
      z = c(TRUE, FALSE),
      w = c(2, 3, 4))
  )
)
str(nested_list)
## List of 2
## $ a: num [1:3] 1 2 3
## $ b:List of 2
## ..$ x: chr [1:3] "a" "b" "c"
## ..$ y:List of 2
## .. ..$ z: logi [1:2] TRUE FALSE
##   .. ..$ w: num [1:3] 2 3 4 

We want to flatten the list so that the nested levels are all brought to the first level. The following code solves the problem using rapply() and <<-.

First, we need to know that rapply() is a recursive version of lapply(). In each iteration, the supplied function is called with an atomic vector at a particular level in the list until all atomic vectors at all levels are exhausted. Calling rapply(nested_list, f) basically runs in the following manner:

f(c(1, 2, 3))
f(c("a", "b", "c"))
f(c(TRUE, FALSE))
f(c(2, 3, 4)) 

Keep in mind, that we should work out a solution to flatten nested_list. The solution that we will discuss is inspired by a Stackoverflow answer (http://stackoverflow.com/a/8139959/2906900), which smartly uses rapply(). First, we will create an empty list to receive individual vectors in the nested list and a counter:

flat_list <- list()
i <- 1 

Then, we will use rapply() to recursively apply a function to nested_list. In each iteration, the function receives an atomic vector in nested_list through x. The function sets thei th element of flat_list to x and increments the counter i:

res <- rapply(nested_list, function(x) {
flat_list[[i]] <<- x
i <<- i + 1
}) 

With the iterations done, all atomic vectors are stored in flat_list at the first level. The value returned by rapply() is as follows:

res
## a  b.x b.y.z b.y.w
## 2   3   4     5 

As a result of i <<- i + 1, the values in res are of no much importance. However, the names of res are useful to indicate the original levels and names of each element in flat_list. So we let flat_list also have the names of res to indicate the origin of each element:

names(flat_list) <- names(res)
str(flat_list)
## List of 4
## $ a : num [1:3] 1 2 3
## $ b.x : chr [1:3] "a" "b" "c"
## $ b.y.z: logi [1:2] TRUE FALSE
## $ b.y.w: num [1:3] 2 3 4 

Finally, all elements in nested_list are stored in a flat way in flat_list.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset