In the previous sections, we saw how to perform spatial interpolation of point data in R using several methods, including IDW, OK, and UK. We also learned that, in the case of minimum temperature in 2002, UK outperformed the other two methods in terms of accuracy, with a LOOCV RMSE value of 1.46, compared to 1.88 and 1.68 for IDW and OK, respectively. In this section, we are going to see how we can automate spatial interpolation in order to produce a set of temperature maps for different years with a single code execution command. For this purpose, we are going to construct two loops:
We will begin with the first task of systematically evaluating RMSE among the different years, variables, and methods. A very useful function for such tasks, named expand.grid
, will help us keep track of the examination. Given a set of vectors, expand.grid
returns a data.frame
object with all possible combinations of the elements from each vector. The following expression, for example, creates a data.frame
object, named cv_results
, with four columns (variable
, year
, method
, and rmse
):
> cv_results = expand.grid( + variable = c("mmnt", "mmxt"), + year = 2001:2010, + method = c("IDW", "OK", "UK"), + rmse = NA)
Since only three of the columns have more than one possible element, and there are 2 variables * 10 years * 3 methods = 60 unique ways to combine them, cv_results
has 60 rows. The first few rows of this data.frame
object look as follows:
> head(cv_results) variable year method rmse 1 mmnt 2001 IDW NA 2 mmxt 2001 IDW NA 3 mmnt 2002 IDW NA 4 mmxt 2002 IDW NA 5 mmnt 2003 IDW NA 6 mmxt 2003 IDW NA
This table is going to help us in two ways:
variable
, year
, and method
columns. This way, we can be sure that all the possible combinations are covered.rmse
column so that, by the time the loop execution is complete, we will conveniently have a table with all of the results.The loop will be defined to go through all rows of cv_results
, with i
being assigned the current row index each time. The code within the loop will then perform the following operations:
dat
(using the create_pnt
function we previously defined) with the variable and year set according to the current row of cv_results
(rather than at fixed values such as 2002
and "mmnt"
).form
) and variogram model (v_mod
).gstat
object based on the appropriate point layer (dat
), formula (form
), and variogram model (v_mod
).rmse
column of cv_results
.The code snippet, with the code sections for each of the preceding steps marked by a comment, is as follows:
> for(i in row(cv_results)) { + # (1) Create point layer as required + dat = create_pnt(stations = spain_stations, + annual = spain_annual, + year = cv_results$year[i], + variable = cv_results$variable[i], + new_proj = proj4string(dem_spain)) + # (2) Create *form* and *v_mod* objects + if(cv_results$method[i] == "IDW") { + form = value ~ 1 + v_mod = NULL} else { + if(cv_results$method[i] == "OK") { + form = value ~ 1} + if(cv_results$method[i] == "UK") { + form = value ~ elevation} + v_mod = + autofitVariogram( + formula = form, + input_data = dat)$var_model} + # (3) Create gstat object + g = gstat(formula = form, data = dat, model = v_mod) + # (4) Perform cross-validation + cv = gstat.cv(g) + # (5) Calculate RMSE and assign to cv_results + cv_results$rmse[i] = rmse(cv) + }
Note that, in step 2, nested conditional statements are used in order to determine form
and v_mod
according to the interpolation method that is being used. We can write down the procedure this particular code section performs in the form of a decision tree:
method
is "IDW"
, then form
is value~1
and v_mod
is NULL
method
is "OK"
, then form
is value~1
method
is "UK"
, then form
is value~elevation
v_mod
is calculated according to form
Once the loop execution is completed (this may take a few moments), the rmse
column of cv_results
will be filled with the resulting RMSE values instead of NA
values:
> head(cv_results) variable year method rmse 1 mmnt 2001 IDW 1.942872 2 mmxt 2001 IDW 1.594996 3 mmnt 2002 IDW 1.880490 4 mmxt 2002 IDW 1.570574 5 mmnt 2003 IDW 1.912887 6 mmxt 2003 IDW 1.605938
Note that, for instance, the RMSE value at the third row table (1.880490
), which corresponds to the IDW interpolation of "mmnt"
for 2002
as expected, is identical to the result we have previously obtained manually. Several interesting comparisons can be made using this RMSE table. As previously stated, we are mainly interested in seeing which interpolation method is the most accurate, on average, across 10 years and the two variables. To find out, we can use tapply
:
> tapply(cv_results$rmse, cv_results$method, mean) IDW OK UK 1.773243 1.908957 1.458997
The result shows that, on average, UK yields the lowest prediction error among the three methods. Interestingly, OK is less accurate, on average, than IDW although in the particular example we previously saw it was the other way around. Based on this result, UK is going to be our method of choice to produce annual temperature maps of Spain.
It should be noted that comparing cross-validation RMSE is only meaningful among different spatial prediction models of the same data. Otherwise, RMSE may be higher or lower not only because the prediction error is different but also because the predicted values themselves are on different scales. For example, RMSE of the precipitation amount ("tpcp"
) prediction would have been much higher than RMSE of temperature prediction in our case (you can try and see this), simply because the generally observed annual precipitation amounts in Spain (in mm) are higher by an order of magnitude than temperatures (in Celsius degrees). However, this does not mean that the prediction models of precipitation are less accurate with respect to the scale of the measured variable itself.
We are ready to move on and produce annual temperature maps using another loop that calculates the predicted values (rather than performing cross-validation, which we just did). Once again, a well-organized way of doing this is to construct a table describing the set of parameters we need for each spatial interpolation, then go through that table and perform the required operations in turn. We will once again use the expand.grid
function, producing all combinations of just two variables this time: year
(the 5 years 2006-2010) and variable
("mmnt"
and "mmxt"
). The method will be left unspecified as we have already decided it should be invariably UK.
> spainT_tab = expand.grid( + year = 2006:2010, + variable = c("mmnt", "mmxt"))
The resulting table, named spainT_tab
, looks as follows:
> spainT_tab year variable 1 2006 mmnt 2 2007 mmnt 3 2008 mmnt 4 2009 mmnt 5 2010 mmnt 6 2006 mmxt 7 2007 mmxt 8 2008 mmxt 9 2009 mmxt 10 2010 mmxt
We are ready to create the second loop that will go through the rows of spainT_tab
and spatially interpolate the respective temperature data. Since our results are going to be rasters (rather than numeric values, as was the case with the cross-validation loop) in this case, they cannot be appended to the table itself. Instead, we are going to stack the results in a multiband raster where each layer is going to correspond to a row in spainT_tab
. In other words, spainT_tab
is going to function as a supplementary table for the resulting multiband raster (much as dates
served the same purpose for the raster r
; see Chapter 4, Working with Rasters). To construct the raster, we will first create an empty RasterStack
to which the interpolation results will be appended during loop execution:
> spainT = stack()
The loop will go through the rows of spainT_tab
, each time performing the following set of operations:
dat
with the variable and year set according to the current row of spainT_tab
.v
).gstat
object (g
) according to dat
and v
.z
) with interpolate
, based on g
and dem_spain
.z
according to dem_spain
and attaching the result to spainT
.The loop code appears as follows:
> for(i in 1:nrow(spainT_tab)) { + # (1) Create point layer as required + dat = create_pnt(stations = spain_stations, + annual = spain_annual, + year = spainT_tab$year[i], + variable = spainT_tab$variable[i], + new_proj = proj4string(dem_spain)) + # (2) Automatically fit variogram model + v = autofitVariogram(formula = value ~ elevation, + input_data = dat) + # (3) Create gstat object + g = gstat(formula = value ~ elevation, + model = v$var_model, + data = dat) + # (4) Interpolate! + z = interpolate(dem_spain, g, xyOnly = FALSE) + # (5) Mask and add predicted surface to results stack + spainT = stack(spainT, mask(z, dem_spain)) + }
Once loop execution is complete, interpolation results are stored as individual layers of the spainT
raster. The raster should consist of 10 layers, corresponding to the 10 rows in spainT_tab
. Right now, the layer names are not very informative:
> names(spainT) [1] "var1.pred.1.1.1" "var1.pred.2.1.1" "var1.pred.1.2.1" [4] "var1.pred.2.2.1" "var1.pred.1.1.2" "var1.pred.2.1.2" [7] "var1.pred.1.2.2" "var1.pred.2.2.2" "var1.pred.1" [10] "var1.pred.2"
However, with spainT_tab
at our disposal, we can easily change this as follows:
> names(spainT) = paste(spainT_tab$variable, + spainT_tab$year, + sep = "_") > names(spainT) [1] "mmnt_2006" "mmnt_2007" "mmnt_2008" "mmnt_2009" "mmnt_2010" [6] "mmxt_2006" "mmxt_2007" "mmxt_2008" "mmxt_2009" "mmxt_2010"
Let's now plot spainT
using the levelplot
function. With layout=c(5,2)
, we make sure that the minimum and maximum temperature maps are plotted in distinct rows:
> library(rasterVis) > levelplot(spainT, par.settings = "BuRdTheme", layout = c(5,2))
The graphical output is shown in the following screenshot:
The preceding screenshot shows predicted minimum (top row) and maximum (bottom row) temperatures in Spain for each of the years from 2006 to 2010. As can be expected, maximum temperatures are always higher than minimum temperatures. The predicted spatial pattern of temperature is fairly similar over the years, which also makes sense—colder areas are generally colder repeatedly, in all years, as are warmer areas. However, the latter poses a problem when interpreting this type of result, as the characteristic temperature pattern within each year overwhelms the differences among years and makes it difficult for us to notice the year-to-year variation. This can be partially addressed by plotting minimum and maximum temperature rasters separately (in order to use a wider color scale each time). The interested reader can see for himself that it does little to solve the problem.
When we are interested in interannual differences only, the most reasonable course of action is to find out what is the characteristic annual temperature state, find out the deviation of each individual year's state from the characteristic state, and then compare the deviation images rather than the original ones. Calculating deviations from a general trend is a very common practice to highlight patterns of interest in our data, and there are many ways to find both the general trend and the deviations of each observation from it. Here, we will employ the simplest possible approach—the characteristic temperature in each pixel will be defined as the 5-year average observed in that pixel, and the deviations will be equal to arithmetic differences of each observation from the average.
Since we need to treat the minimum and maximum temperature data separately, we will first create two vectors identifying the relevant spainT
layers for each case:
> mmnt_layers = which(spainT_tab$variable == "mmnt") > mmnt_layers [1] 1 2 3 4 5 > mmxt_layers = which(spainT_tab$variable == "mmxt") > mmxt_layers [1] 6 7 8 9 10
We will also create a new two-band RasterStack
, named means
, to hold the minimum and maximum mean temperature pattern in its first and second layers, respectively. The two layers will also be named accordingly ("mmnt"
and "mmxt"
):
> means = stack(mean(spainT[[mmnt_layers]]), + mean(spainT[[mmxt_layers]])) > names(means) = c("mmnt", "mmxt")
Now that we have the means, all we need to do is subtract the appropriate mean from each set of layers in spainT
(the mean of the minimum temperatures from layers 1-5 and the mean of the maximum temperatures from layers 6-10). There are numerous ways to do this. For example, we can divide the spainT
raster into two, subtract the respective mean from each substack, and then combine them once again. A more elegant way, extendable to any number of categories, however, is to create a new means raster with the same number of layers as spainT
, by means of duplication, with the relevant means occupying its layers. In other words, we need a raster with means[[1]]
duplicated five times (occupying layers 1-5) and means[[2]]
duplicated five times (occupying layers 6-10). To do this, we can utilize the fact that raster layers can also be selected by their names. The following expression therefore takes the "mmnt"
and "mmxt"
layers and duplicates them in agreement with the spainT_tab$variable
column:
> means = means[[spainT_tab$variable]]
This gives us the desired result, a 10-band raster with temperature means for either the minimum or maximum temperature, according to spainT_tab
(and thus matching spainT
). The layer names reveal that indeed the first five layers correspond to the minimum temperature and the last five layers correspond to the maximum temperature:
> names(means) [1] "mmnt.1" "mmnt.2" "mmnt.3" "mmnt.4" "mmnt.5" "mmxt.1" [7] "mmxt.2" "mmxt.3" "mmxt.4" "mmxt.5"
What is left to be done is just to subtract one 10-band raster from the other:
> spainT = spainT - means
The data we now have in the new spainT
raster is the temperature deviations from the five-year average pattern of either the minimum or maximum temperature. We can check and see that the deviations are within the range of -4 and +4 degrees Celsius around the respective 5-year mean of 2006-2010:
> range(spainT[], na.rm = TRUE) [1] -4.020094 3.200148
Plotting these deviations will help us see more clearly where exactly the temperature was high or low, with respect to the average pattern, and what the magnitude of the departure was. This time, we will also indicate that contours should be drawn at 1°C intervals:
> levelplot(spainT, + par.settings = "BuRdTheme", + layout = c(5,2), + contour = TRUE, + at = c(-4:-1,1:4))
The graphical output is shown in the following screenshot:
This time, we can easily notice, for example, that the minimum temperatures in 2010 were higher than average in southern Spain while the maximum temperatures were lower than average in the north of the country.
This is the final product we were looking for, according to the goals specified at the beginning of this section. However, we are not done yet. In the next chapter, we will use spainT
as one of the sample datasets while discussing how to produce plots with the ggplot2
package. Since ggplot2
is a general-purpose graphical package, the input data must come as a data.frame
object. Moreover, the data table should be tidy (see the previous chapter) in order to conveniently exploit all of the possibilities this package offers. What does that mean in the present case? It means that we need to have a data.frame
object with all the data currently held in spainT
, with each column corresponding to a variable and each row corresponding to an observation. In this case, the variables are as follows:
Each row in such a data.frame
object will correspond to one pixel of an individual layer in spainT
.
Luckily, the as.data.frame
function (that we encountered in the previous chapter, when converting a matrix
to a data.frame
) also has a method to convert rasters to the data.frame
objects. By setting xy=TRUE
, we are specifying that we would like to have a table not only with the raster values (from all of its layers) but also with the respective spatial coordinates:
> spainT = as.data.frame(spainT, xy = TRUE) > head(spainT) x y mmnt_2006 mmnt_2007 mmnt_2008 mmnt_2009 1 -11957.925 4857128 NA NA NA NA 2 -8357.925 4857128 NA NA NA NA 3 -4757.925 4857128 NA NA NA NA 4 -1157.925 4857128 NA NA NA NA 5 2442.075 4857128 NA NA NA NA 6 6042.075 4857128 NA NA NA NA mmnt_2010 mmxt_2006 mmxt_2007 mmxt_2008 mmxt_2009 mmxt_2010 1 NA NA NA NA NA NA 2 NA NA NA NA NA NA 3 NA NA NA NA NA NA 4 NA NA NA NA NA NA 5 NA NA NA NA NA NA 6 NA NA NA NA NA NA
Each row in this table represents a single pixel, while its spatial coordinates and values in each layer are specified in the respective columns. However, from the first few rows, we can see that NA
cells were also included. Since they will not be plotted anyway, we can readily exclude them:
> spainT = spainT[complete.cases(spainT), ] > head(spainT) x y mmnt_2006 mmnt_2007 mmnt_2008 mmnt_2009 33 103242.08 4857128 0.7383187 -0.070727914 0.008270229 0.5218489 34 106842.08 4857128 0.7392765 -0.055515954 0.024604441 0.5227165 38 121242.08 4857128 0.7458803 -0.018719502 0.063198494 0.5252469 39 124842.08 4857128 0.7456079 0.002858330 0.084895939 0.5240072 40 128442.08 4857128 0.7454912 0.023069161 0.105275777 0.5226526 347 92442.08 4853528 0.7061661 -0.006305609 0.051584103 0.5112048 mmnt_2010 mmxt_2006 mmxt_2007 mmxt_2008 mmxt_2009 mmxt_2010 33 -1.197710 2.012048 -2.361182 1.122896 1.941084 -2.714846 34 -1.231081 1.988164 -2.341944 1.118404 1.918249 -2.682873 38 -1.315606 1.932082 -2.291416 1.103507 1.857861 -2.602034 39 -1.357369 1.894492 -2.252492 1.094646 1.809158 -2.545804 40 -1.396489 1.859306 -2.215895 1.086683 1.763085 -2.493179 347 -1.262649 1.870778 -2.157029 1.059672 1.696265 -2.469685
Two things are left to be done to bring this data.frame
object to the desired form:
value
column and keeping all other variables in separate individual columns. This can be achieved with the melt
function (see Chapter 3, Working with Tables, for more details), specifying "x"
and "y"
columns as the ID variables so that the other columns are transferred into a single variable
column:> library(reshape2) > spainT = melt(spainT, id.vars = c("x", "y")) > head(spainT) x y variable value 1 103242.08 4857128 mmnt_2006 0.7383187 2 106842.08 4857128 mmnt_2006 0.7392765 3 121242.08 4857128 mmnt_2006 0.7458803 4 124842.08 4857128 mmnt_2006 0.7456079 5 128442.08 4857128 mmnt_2006 0.7454912 6 92442.08 4853528 mmnt_2006 0.7061661
variable
column into two: the variable itself ("mmnt"
or "mmxt"
) and the year (2006
, 2007
, 2008
, 2009
, or 2010
). Substring extraction with substr
(see Chapter 2, Working with Vectors and Time Series, for more information) comes in handy for this purpose since we can see that characters 1-4 in each element of spainT$variable
consistently correspond to the variable, and characters 6-9 correspond to the year (to be sure, there are other ways to extract substrings in less convenient scenarios; for instance, the strsplit
function can be used in the present context). Using the following two expressions, we first create a new (numeric) column holding the year values, and then modify the variable
column to retain just the variable names:> spainT$year = as.numeric(substr(spainT$variable, 6, 9)) > spainT$variable = substr(spainT$variable, 1, 4) > head(spainT) x y variable value year 1 103242.08 4857128 mmnt 0.7383187 2006 2 106842.08 4857128 mmnt 0.7392765 2006 3 121242.08 4857128 mmnt 0.7458803 2006 4 124842.08 4857128 mmnt 0.7456079 2006 5 128442.08 4857128 mmnt 0.7454912 2006 6 92442.08 4853528 mmnt 0.7061661 2006
The final data.frame
object is now complete.