Spatial interpolation is an example of a geostatistical analysis technique with a wide range of applications. In this chapter, we are going to learn how spatial interpolation can be carried out in R through examples of interpolating meteorological point measurements to create annual temperature maps of Spain.
The purpose of this exercise is two-fold. First, we will see how several common interpolation methods are applied in practice in R. We will see, for instance, that specialized classes are used to represent the input data and/or the results of statistical analyses in R, and witness the advantages of such an approach. Second, we will see how, through the use of loops, we can automate complex tasks such as spatial interpolation, and perform them repeatedly in order to accomplish otherwise unfeasible tasks.
In this chapter, we are going to use objects previously created in Chapter 3, Working with Tables.
In this chapter, we'll cover the following topics:
Spatial interpolation is the procedure by which the behavior of a certain phenomenon of interest is predicted in locations where it has not been measured. For this purpose, we need a spatial prediction model—a set of procedures to obtain the predicted values given the calibration data. The two types of calibration data usually encountered are:
The spatial prediction model of our choice is calibrated using the calibration data. This model can then be used to calculate the predicted level of the phenomenon of interest in any location (usually points). The two main types of spatial interpolation methods recognized are:
Simply put, while both types of models use the calibration data to make predictions, deterministic models are predefined using fixed formulas to calculate predicted values given the calibration data; on the other hand, statistical models use the calibration data twice, first to fine-tune the model itself and then (as in the deterministic models) to calculate the predicted values. Examples of deterministic models are inverse distance weighted (IDW) interpolation and splines. Examples of statistical models are kriging and ordinary regression.
Our purpose in this chapter is to present spatial interpolation from a practical point of view. Comprehensively covering the statistical theory and equations behind spatial interpolation techniques is well beyond the scope of this book. Readers familiar with the subject of spatial interpolation will, hopefully, get a perspective from this chapter on the way statistical tools such as these are applied in R. Readers unfamiliar with the subject of spatial interpolation are referred to the not-too-technical overview in the first two chapters of Tomislav Hengl's book A Practical Guide to Geostatistical Mapping, published in 2009 (available online), for supplementary reference. Those uninterested in this particular subject may skip to the last image in this chapter, and the few paragraphs that follow, where a dataset that we are going to use in the next chapter is described.
Two different approaches, or point of views, in selecting the appropriate spatial prediction model are also worth mentioning:
Here, we strictly follow the first approach; we will test several commonly used models to see which one produces the least number of prediction errors, and then use it to produce continuous maps of temperature in Spain. However, we should remember that we are not investigating the mechanisms underlying the behavior of temperature in space, which would suggest the second approach. Diggle and Ribeiro's book Model-based Geostatistics (2007) is a good example that focuses on the latter approach utilizing another R package (geoR
) for geostatistical analysis, which we do not cover here.
In the examples in this chapter, we will be working with three data sources: the spain_stations.csv
and spain_annual.csv
tables that hold the annual climatic data from Spain (see Chapter 3, Working with Tables), and the spain_elev.tif
raster with Spain's DEM (see the previous chapter). The first thing we will do is assemble these objects into memory. We will also perform a four-fold aggregation of the DEM, from a 900 meter resolution to a 3,600 meter resolution, in order to make subsequent calculations faster:
> library(raster) > dem_spain = raster("C:\Data\spain_elev.tif") > dem_spain = aggregate(dem_spain, 4) > spain_stations = read.csv("C:\Data\spain_stations.csv", + stringsAsFactors = FALSE) > spain_annual = read.csv("C:\Data\spain_annual.csv", + stringsAsFactors = FALSE)
As we have already seen in previous chapters, spain_stations
holds spatial location information for the 96 meteorological stations of Spain, while spain_annual
holds the meteorological data itself. The tables can be joined by the common station
column. The meteorological data consists of average measurements of three variables (precipitation, minimum temperature, and maximum temperature) obtained in different years (1984-2013). We are going to interpolate measurements for different year/variable combinations.
Therefore, it will be useful to write a function that accepts the variable and year we are currently interested in, and returns a point layer with the required data. The code for such a function consists of procedures already familiar to the reader from the previous chapters, so we will review its contents briefly. The function accepts five arguments:
stations
: A data.frame
object with location data for stations (in our case, spain_stations
)annual
: A data.frame
object with the annual meteorological data (in our case, spain_annual
)year
: The year we would like to get meteorological data forvariable
: The meteorological variable we would like to get data fornew_proj
: The CRS for the output SpatialPointsDataFrame
object as a PROJ.4 stringUtilizing these arguments, the function performs the following steps:
stations
table and converts it to a SpatialPointsDataFrame
object.stations
(geographical CRS).-10
; see the map of Spain's mainland and the Canary Islands to understand why).variable
and the required year
.stations
.NA
for the respective year/variable combination.new_proj
and returns the resulting object.The function code is as follows:
> create_pnt = function(stations,annual,year,variable,new_proj) { + library(plyr) + # (1) Promoting stations to SpatialPointsDataFrame + coordinates(stations) = ~ longitude + latitude + # (2) Defining geographic CRS + proj4string(stations) = CRS("+proj=longlat +datum=WGS84") + # (3) Removing Canary Islands stations + stations = stations[coordinates(stations)[, 1] > -10, ] + # (4) Subsetting climatic data + annual = annual[ + annual$year == year & + annual$variable == variable, ] + # (5) Joining meteorological data with stations layer + stations@data = join(stations@data, annual, by = "station") + # (6) Removing incomplete records + stations = stations[complete.cases(stations@data), ] + # (7) transforming to the required CRS + spTransform(stations, CRS(new_proj)) + }
Note that the stations
and annual
arguments are local objects that exist only when the function code is executed; they are not related to the spain_annual
and spain_stations
objects of the global environment, which are left unaltered (see Chapter 2, Working with Vectors and Time Series).
After reading the function into memory, let's give it a try. The following function call creates, for example, a SpatialPointsDataFrame
object containing average annual minimum temperature for 2002 defined in the CRS of dem_spain
:
> dat = create_pnt(stations = spain_stations, + annual = spain_annual, + year = 2002, + variable = "mmnt", + new_proj = proj4string(dem_spain))
We can examine the first few rows of the attribute table of the resulting object dat
to see that it, indeed, contains minimum temperature values ("mmnt"
) from the year 2002
:
> head(dat@data) station elevation variable year value 1 GHCND:SP000003195 667 mmnt 2002 10.475000 2 GHCND:SP000004452 185 mmnt 2002 10.775000 3 GHCND:SP000006155 7 mmnt 2002 14.491667 4 GHCND:SP000008027 251 mmnt 2002 11.100000 5 GHCND:SP000008181 4 mmnt 2002 11.608333 6 GHCND:SP000008202 790 mmnt 2002 5.983333
Since the Canary Island stations were removed altogether (see step 3), and stations that do not have a minimum temperature record for 2002 were subsequently filtered out (see step 6), the number of features in the point layer is always smaller than 96 and differs between year/variable combinations. For example, dat
consists of 75 points:
> nrow(dat) [1] 75
In the following sections, you are going to learn how to interpolate the dat
temperature records using different methods. Afterwards, we are going to write code to automatically evaluate the prediction error for each model. Finally, we will interpolate data for several years and variables at once using a loop.
Before we begin our journey through the three main interpolation methods featured in this chapter (see the next three sections), it is worth pointing out the principle common to most spatial interpolation methods: that the states of the phenomenon of interest are more similar among locations nearer to each other than among locations further apart. In other words, the phenomenon is autocorrelated in space (otherwise predictions should simply be reduced to a global mean, or calculated based on a nonspatial prediction model, such as ordinary regression). With spatial autocorrelation present, it makes sense for predicted values to be more similar to measured values that are nearest to them, and less similar to measured values further apart. For example, if we have a temperature measurement of 20°C at location A, it would be reasonable to assume that the temperature in location B, which is, say, 100 meters away from A, will be fairly similar to 20°C since the air temperature is a spatially autocorrelated phenomenon. The way and the degree to which measured values affect predictions as a function of distance is the main feature that differentiates the various interpolation methods.
The simplest interpolation method based on spatial autocorrelation is nearest-neighbor interpolation. However, it is rarely used in practice in the context of mapping for reasons that will become apparent shortly. It is hereby presented only to demonstrate the concept of spatial interpolation in its simplest form.
In nearest-neighbor interpolation, each predicted location simply takes the value of the nearest measured location. For instance, returning to the previous temperature example, location B will receive a predicted value of exactly 20°C unless there is another measured point within fewer than 100 meters from B (and then it will determine B's predicted value instead of A). We will now perform nearest-neighbor interpolation of the temperature data in dat
to understand this point more clearly. We will hereby write code that performs nearest-neighbor interpolation. It is going to consist of the following steps:
grid
).dat
is the one closest to each point in grid
.dat
to the grid
points.As for the first step, prediction points are most commonly determined using a regular grid (that is, a raster) so that as a result of interpolation we would get a continuous predicted surface. Following this approach, we will use dem_spain
for our grid of prediction points. However, since we are going to calculate distances between pairs of points using the gDistance
function (which operates on vector layers; see Chapter 5, Working with Points, Lines, and Polygons), the raster cells should first be converted to points with rasterToPoints
(see the previous chapter) as follows:
> grid = rasterToPoints(dem_spain, spatial = TRUE)
Moving on to the second step, now we can calculate the distance matrix between each point in dat
and each point in grid
:
> library(rgeos) > dist = gDistance(dat, grid, byid = TRUE)
The resulting object, dist
, is a matrix
with 39,250 rows (corresponding to the grid
features, that is, the number of cells of dem_spain
that are not NA
) and 75 columns (corresponding to the dat
features):
> dim(dist) [1] 39250 75
The values of this matrix are the pairwise distances, in meters, between the dat
and grid
features. In order to assign the value of the nearest points in dat
to grid
, we need to find out which point is the nearest neighbor in each case. This can be achieved with apply
and which.min
. The following expression yields a vector of indices indicating the minimal element in each row of dist
:
> nearest_dat = apply(dist, 1, which.min)
Since nearest_dat
now holds the indices of the dat
features from which we need to obtain the temperature value for each predicted point in grid
, the following expression assigns those values to a new column, named nn
, in the attribute table of grid
:
> grid$nn = dat$value[nearest_dat]
We have now completed the third step. What is left is to convert grid
back to a raster, that is, to rasterize it (see the previous chapter) for easier visualization. The following expression rasterizes grid
, with the field
parameter set to the "nn"
attribute table column, thus transferring its values to the dem_spain
raster:
> grid = rasterize(grid, dem_spain, "nn")
The result is a nearest-neighbor predicted surface with the value of each pixel being the predicted temperature at the respective location. Let's see what it looks like using the following expressions:
> plot(grid) > plot(dat, add = TRUE)
The graphical output is shown in the following screenshot:
The preceding screenshot clearly shows that the nearest-neighbor interpolation method has, in practice, divided the area of interest into discrete subareas. Each subarea includes all the locations that are more proximate to a given measured point (that is, a given meteorological station) than to any other. In fact, we could have also carried out nearest-neighbor interpolation by creating polygons defining such subareas and rasterized them, to produce the same result. Within the context of GIS, such polygons are known as Thiessen or Voronoi polygons (interested readers will find at least one way to create them in R; for example, using the dirichlet
function of the spatstat
package).
As mentioned earlier, nearest-neighbor interpolation is rarely useful to predict the behavior of natural phenomena in space. For example, it is clearly unrealistic to assume that discrete polygonal areas surrounding each meteorological station have uniform temperatures, with a sharp increase or decrease in temperature, along the borders between them (as the previous screenshot suggests). In reality, the temperature changes more or less gradually from place to place. To describe such gradual transitions, we need to use more elaborate methods, such as those described in the upcoming sections.
The three interpolation methods we will employ next—IDW, Ordinary Kriging (OK), and Universal Kriging (UK)—all utilize the same principal procedure, where the predicted value of a given point is determined as a weighted average of the values from the measured points. Moreover, the weight of each measured value is always a function of its distance from the point we are trying to predict (a nearby point usually having a higher weight, or influence, on the predicted value than a point further away). Within this framework, nearest-neighbor interpolation is, in fact, an extreme private case, where the weight of the nearest point is 1 while the weights of all other points are 0.
To better understand the subsequent material in this chapter, we can already state that the differences between the three methods concern two properties: the trend and the weights definition. The trend is basically an additional function, independent of the measured points values, added to the weighted average of the latter. In IDW, the trend is 0 and in OK it is a constant value, thus having no effect on the predicted pattern in either case. However, in UK, the trend is a function of additional covariates. Regarding the weights, in IDW they are arbitrarily determined (thus the method is considered deterministic); in OK and UK, conversely, they are estimated from the data itself (thus the methods are considered statistical).
As noted earlier, the predicted values in all three methods are the weighted averages of measured points:
Here, is the predicted value at location 0, is the measured value i, and is the weight for the measured value i, while n is the total number of measured points.
In IDW, the weight is a function of the inverse distance, shown as follows:
Here, is the geographic distance between the measured point i and the predicted point at location 0 to the power , and n is the total number of measured points. The default value for is 2. This means that the importance of each measured point in determining a predicted value diminishes as a function of squared distance. When is smaller, the predicted surface will be smoother; when is larger, the predicted surface will be less smooth, giving more emphasis to the nearest neighbor.
We will now interpolate the temperature data in dat
using the IDW method. To apply this method, and subsequent ones later in this chapter, we will use functions in the gstat
package. This package provides extensive capability for univariate and multivariate geostatistical analysis, of which we will see a few examples.
For an overview of gstat
, see the official introductory document at http://cran.r-project.org/web/packages/gstat/vignettes/gstat.pdf.
The core of the gstat
package is a function named gstat
, which is used to produce objects (of a class also named gstat
) that hold all necessary information to perform spatial interpolation (as well as other operations such as cross-validation). Among the numerous parameters that the gstat
function accepts, there is, in fact, no parameter where we can specify the desired interpolation method. Instead, when interpolation is triggered on a gstat
object, the method is determined based on the input data the object contains. In this chapter, we are going to explore only a portion of the functionality of gstat
, utilizing three of its parameters:
formula
: The formula that defines the dependent and independent variables (an object of the formula
class)data
: The measured point data (a SpatialPointsDataFrame
object; it can also be a data.frame
object but we will not use this option)model
: The variogram model (an object of the variogramModel
class)These three parameters determine whether IDW, OK, or UK is used based on the following decision tree:
model=NULL
, the default), IDW is usedmodel
, and the formula
:Our dependent variable column is value
and an intercept-only model (that is, a model with no independent variables) formula in R is, by convention, specified by ~1
. Therefore, the following expression creates a gstat
object (named g
) that holds the necessary information to predict the minimum temperature based on the measured data in dat
using IDW:
> library(gstat) > g = gstat(formula = value ~ 1, data = dat)
Printing g
shows a summary of the information it contains as follows:
> print(g) data: var1 : formula = value`~`1 ; data dim = 75 x 5
To interpolate, what we now need is to provide the points for which we would like to make predictions. Conveniently, the raster
function includes a function named interpolate
that, given a raster (the object
parameter) and a spatial prediction model (the model
parameter), yields a new raster with predicted values. This way, we do not need to make the manual conversions to and from a raster, as we did in nearest-neighbor interpolation. For example, the following expression uses the model and calibration data held in g
to make temperature predictions for the grid defined by dem_spain
:
> z = interpolate(dem_spain, g)
There are two things to note here:
[inverse distance weighted interpolation]
appearing on the screen. This reassures the user that indeed the expected type of model has been chosen by gstat
.dem_spain
raster serves only as a grid pointing to the locations where we would like to calculate predicted values (the raster values play no role in the interpolation). This is opposed to UK interpolation, where the raster serves both to point at predicted locations and provide covariate values for each location.Plotting the predicted surface z
will show that predicted values were generated for all raster cells (including NA
cells). It is usually reasonable to make predictions for a more specific extent, such as the land area of Spain. We can use mask
to remove the unnecessary predictions:
> z = mask(z, dem_spain)
Now let's plot the predicted value raster, and the station locations on top of it, with the following pair of expressions:
> plot(z) > plot(dat, add = TRUE)
The graphical output is shown in the following screenshot:
In the preceding screenshot, we see the predicted average minimum temperature for 2002 in Spain. It seems the central-northern parts of the country were generally colder, while the south was warmer (this general pattern was also evident in the nearest-neighbor interpolation image). The effect of the station locations on the predicted surface is also notable. As you remember, the weight of a given station increases as the predicted point is nearer. Therefore, each point generates a sort of circular zone of influence around itself, especially if its value markedly differs from the values of its proximate neighbors.
To understand the role of the parameter (see equation (2)) more vividly, let's produce two more IDW-predicted surfaces: one with set to a very small value (say, 0.3
) and one with set to a very large value (say, 30
). Then, we will compare the results with the preceding graphical output (where was set to 2
, which is the default value). The following code generates the two additional predicted surfaces and assigns them to objects z1
and z2
, respectively:
> g1 = gstat(formula = value ~ 1, data = dat, set = list(idp=0.3)) > g2 = gstat(formula = value ~ 1, data = dat, set = list(idp=30)) > z1 = interpolate(dem_spain, g1) > z2 = interpolate(dem_spain, g2) > z1 = mask(z1, dem_spain) > z2 = mask(z2, dem_spain)
Note that the set=list(idp=0.3)
part determines the value of in gstat
. Using the following expression, we can visualize the three rasters: z1
, z
, and z2
(the pch
and cex
parameters are used to control the symbol type and size, respectively, when drawing the dat
points):
> plot(z1, main = c("beta = 0.3")) > plot(dat, add = TRUE, pch = 20, cex = 0.5) > plot(z, main = c("beta = 2")) > plot(dat, add = TRUE, pch = 20, cex = 0.5) > plot(z2, main = c("beta = 30")) > plot(dat, add = TRUE, pch = 20, cex = 0.5)
The graphical outputs shown here, side by side, are produced as a result. The respective values of are stated above each output:
These images demonstrate the previously stated expectations. Decreasing (the leftmost image) results in a smoother surface due to more homogeneous weight distribution among the measured points. On the other hand, increasing (the rightmost image), results in sharper boundaries between the zones of influence of each point, resembling nearest-neighbor interpolation results in appearance.
Now that we have produced several interpolation results, how can we assess the predictive ability of this (or any other) interpolation method? The obvious way is to compare predicted values to observed ones, preferably when the validation points come from an independent dataset—one that did not participate in model calibration. Since measurements are usually scarce and valuable (we do not have an infinite number of meteorological stations), however, we would like to include each and every measurement in the model calibration, leaving no measurements for validation. The compromise approach to this problem is the process of cross-validation, where we set aside some of the observed points for validation. In the process of cross-validation, we then try to predict their values using the remaining points and compare the predicted values to the observed ones. A special case of cross-validation is leave-one-out cross-validation (LOOCV), where every observed point is removed in turn, its value being predicted based on the remaining points to obtain a table with observed and predicted values for all observations. Assessing the differences between observed and predicted values based on such a table gives us a measure of the accuracy of our prediction model.
The gstat.cv
function can automatically perform cross-validation given an object of the gstat
class. The default method is LOOCV; therefore, in the following expression, all we need to do is perform LOOCV on the IDW interpolation of temperature data:
> cv = gstat.cv(g)
This expression, in practice, triggers the execution of 75 spatial interpolation operations, each time leaving out one meteorological station and trying to predict its temperature value using the 74 remaining ones. The result is a SpatialPointsDataFrame
object with the same number of features as dat
and with the same spatial locations. Only the attribute table is different, now containing the cross-validation results:
> head(cv@data) var1.pred var1.var observed residual zscore fold 1 8.971923 NA 10.475000 1.5030771 NA 1 2 10.244447 NA 10.775000 0.5305532 NA 2 3 10.914591 NA 14.491667 3.5770756 NA 3 4 9.992180 NA 11.100000 1.1078202 NA 4 5 12.024766 NA 11.608333 -0.4164331 NA 5 6 7.435642 NA 5.983333 -1.4523087 NA 6
The attribute table columns of the cross-validation result refer to the following information:
var1.pred
: Predicted valuevar1.var
: Variance (only for kriging)observed
: Observed valueresidual
: Residual, the difference between observed and predicted valueszscore
: Z-score (only for kriging)fold
: Cross-validation countThe important column, for our purpose here, is residual
—the difference between the observed and predicted values. Based on the residuals, we can assess how far predicted values are from the observed ones, or in other words, prediction accuracy. One of the simplest and most common metrics of agreement between observed and predicted values is the RMSE. Given a set of predicted-observed value pairs, the RMSE is calculated as follows:
Here, RMSE is the root mean square error, n is the overall number of points, and and are the predicted and observed values at point i, respectively. As the difference between predicted and observed values is smaller, the RMSE will be lower, indicating that the model is more accurate. Given the same input data, we can compare the RMSE values among different models to select the most accurate one. Translating equation (3) to the R syntax, we can calculate RMSE based on the attribute table of cv
as follows:
> sqrt(sum((-cv$residual)^2)/nrow(cv)) [1] 1.88049
The RMSE in this case is equal to 1.88049
. Note that the expression consists of the -cv$residual
vector (equivalent to cv$var1.pred-cv$observed
) squared, then summed, and divided by the total number of points. Finally, the square root of the result is extracted.
For convenience, we can wrap this expression into our own function, which calculates RMSE, given a gstat.cv
output:
> rmse = function(x) sqrt(sum((-x$residual)^2)/nrow(x)) > rmse(cv) [1] 1.88049
We will use this function later.
You may have noticed that the gstat
object we initially defined (g
) was hereby used for two distinct and independent purposes. First, to calculate predicted values (using the interpolate
function), then to perform cross-validation (using the gstat.cv
function). This is not incidental, but rather a characteristic object-oriented behavior common to all, or most, of R's statistical procedures. This approach is advantageous over the menu-based approach commonly encountered in statistical or GIS software, where we carry out a given analysis by making selections in a set of dialog boxes, and receive a set of results (some of which we do not need, while others are lacking since the tool's creator did not include them). Conversely, when a statistical function is applied in R, we usually get an object that holds all the data that is necessary to carry out the analysis (such as g
) or an object that holds a comprehensive set of results (such as cv
). From such objects, we can then derive the results we are interested in by applying specific functions. Moreover, we can create functions of our own to carry out calculations or extract meaningful data, for which no built-in methods have been defined (such as the rmse
function we defined earlier). Readers who are interested in using R as a general statistical analysis toolbox will repeatedly encounter this approach. For example, the lm
function is used to perform ordinary linear regression in R; on its returned object, of class lm
, numerous functions (such as summary
, residuals
, and plot
) may be applied to derive the information we are interested in.
In kriging, as mentioned earlier, the weight given to each measurement when calculating a predicted value (see equation (1)) is determined statistically, rather than arbitrarily (such as by choosing the value of ourselves in equation (2)). The function determining weights in kriging is called a variogram model. The variogram model is a function fitted to the empirical variogram, which in turn describes the spatial autocorrelation structure of the observed pattern. What is important to understand is that the empirical variogram describes the average degree of autocorrelation between observed values, and the variogram model is a continuous function fitted to these data. The variogram model determines, in plain terms, the importance of points nearby and further away in the calculation of predicted values. For example, if autocorrelation is high over short distances, it would make sense to let nearby points largely determine the predicted value, making the predicted surface rougher; when autocorrelation is low, the weight distribution over distance may be more uniform and the predicted surface smoother. Elaborating on the subject of variogram modeling is beyond the scope of this book; interested readers can find more information in A practical guide to geostatistical mapping of environmental variables, Hengle, T. (2007) or in any textbook on geostatistics.
An empirical variogram can be calculated using the variogram
function, once again based on a gstat
object. Let's take a look at the following example:
> ev = variogram(g)
The returned object of the gstatVariogram
class contains semivariance values (expressing the degree of correlation) for different distance bins. Note that there are numerous optional parameters controlling the way a variogram is computed, that we will not go into, such as the way the latter bins are determined (see ?variogram
). The plot
method for a gstatVariogram
object plots the empirical variogram as follows:
> plot(ev)
The following screenshot shows how the plot appears in our case:
We can see that semivariance increases with distance. This means that temperature values are more diverse when considering stations further apart. In other words, temperature is spatially autocorrelated.
As mentioned earlier, a variogram model is a function fitted to the empirical variogram. There are several specific functions that are adequate to describe the characteristic behavior of an empirical variogram (increasing semivariance with distance, usually reaching saturation at some point). These functions, such as the spherical and exponential functions, are generally used for a variogram model (type vgm()
for a list of the functions implemented in gstat
or show.vgms()
for a visual). In addition to choosing one of these functions, we need to decide on its specific parameters. There are three main approaches, not mutually exclusive, to select a variogram model: a function plus its set of parameters (based on an empirical variogram):
vgm
function from the gstat
package)eyefit
function from the geoR
package)fit.variogram
function from the gstat
package, the variofit
function from the geoR
package, or the autofitVariogram
function from the automap
package)Since we are hereby taking a simple, practical approach, we will use the third option of letting the computer select the function that follows the empirical variogram most closely. Moreover, we are going to bypass the requirement to supply the function type and initial parameter estimates (as required, for example, in fit.variogram
) and use a wrapper function named autofitVariogram
that automatically optimizes this decision for us.
The autofitVariogram
function is defined in the automap
package. It accepts the formula
and data
arguments (named input_data
) analogously for gstat
. The following expression, therefore, fits a variogram model to our data:
> library(automap) > v = autofitVariogram(formula = value ~ 1, input_data = dat)
Plotting the returned object of the autofitVariogram
class generates a plot with an empirical variogram and the fitted variogram model on top of it:
> plot(v)
The resulting output is shown in the following screenshot:
The points, once again, show an empirical variogram, while the continuous line is the fitted variogram model. In this case, for instance, the selected model was "Matern, M. Stein's parameterization" (code name "Ste"
), with the four parameter values stated in the bottom-right corner of the image. The object v
contains several components, such as the empirical variogram data points and the sums of squares between the sample variogram and the fitted variogram model. The component we are interested in is the fitted variogram model. This components' name is var_model
, and it can be accessed with the $
operator, using the expression v$var_model
(this behavior stems from the fact that an autofitVariogram
class is built upon a list
, and list elements can be accessed the same way as data.frame
columns with $
).
The important point here is that the v$var_model
component of v
is an object of class variogramModel
, containing the function type and parameters defining a variogram model. A variogramModel
object can be passed to the gstat
function to indicate a variogram model. Therefore, we can create an object containing all the necessary information to perform the OK interpolation as follows:
> g = gstat(formula = value ~ 1, data = dat, model = v$var_model) > g data: var1 : formula = value`~`1 ; data dim = 75 x 5 variograms: model psill range kappa var1[1] Nug 0.4594258 0.0 0.0 var1[2] Ste 10.9549186 626562.4 0.3
The printed output shows that the object indeed contains a variogram model. We can now use the interpolate
function to produce the predicted OK surface, and the mask
function to clip the area of interest, exactly the same way we did in IDW interpolation:
> z = interpolate(dem_spain, g) > z = mask(z, dem_spain)
Those readers who execute the above expression in R should see, this time, the message [using ordinary kriging]
on screen.
Let's plot the interpolation result with the following expressions:
> plot(z) > plot(dat, add = TRUE)
The graphical output is shown in the following screenshot:
One of the notable visual differences in predicted rasters between IDW and kriging, in general, is that the weight of a given measured point always approaches 1 in IDW as we get nearer. Therefore, the values around each measurement point tend to approach the measured value itself (see the previous screenshot); in kriging, however, this is not necessarily so.
We can now perform LOOCV of the OK prediction model in order to compare the resulting RMSE with that of IDW:
> cv = gstat.cv(g) > rmse(cv) [1] 1.680153
The RMSE is smaller in this case, which suggests that OK produces more accurate prediction than IDW interpolation.
Universal Kriging (also referred to as Kriging with External Drift or Regression Kriging) is, in fact, a general model of which OK is a special case. While both methods involve spatial autocorrelation modeling, they differ in the definition of the trend. In OK, the trend is a constant value, while the trend is a linear function of one or more covariates in UK. In other words, in OK the predicted value is some constant plus a weighted function of neighboring measurements, while it is a predicted value based on covariates plus a weighted function of neighboring measurements in UK.
For instance, it is well known that temperature is negatively correlated with elevation (it gets colder as we climb to a higher altitude). Using a simple scatterplot, we can see that this rule holds true in the present case as well:
> plot(value ~ elevation, dat@data, + xlab = "Elevation (m)", + ylab = "Temperature (degrees Celsius)")
The resulting plot clearly shows the negative relationship between elevation and temperature, although the relation is not perfect:
Sometimes the temperature for a given altitude is lower than usual and sometimes higher. Can we create a model where the temperature is predicted using elevation and then fine-tuned using nearby meteorological measurements? UK does exactly this by adding a weighted average of stations values (as OK does) to a general trend surface defined by a linear function based on covariates (in this case, elevation).
To interpolate the temperature data using UK with elevation as a covariate, we need to make two principal alterations compared to the OK procedure:
Starting with the variogram model, as opposed to OK, our formula now contains elevation as an independent variable:
> v = autofitVariogram(formula = value ~ elevation, + input_data = dat)
Plotting the new v
object will show that both the empirical variogram and the variogram model are different from what we previously had since they now concern the residuals from the elevation trend. Next, we need to create the gstat
object, which contains the information to perform UK interpolation, supplying value~elevation
as the formula and v$var_model
as the variogram model:
> g = gstat(formula = value ~ elevation, + data = dat, + model = v$var_model) > g data: var1 : formula = value`~`elevation ; data dim = 75 x 5 variograms: model psill range kappa var1[1] Nug 0.7410872 0.00 0.0 var1[2] Ste 0.9696567 35781.93 0.2
By supplying the appropriate formula in the variogram fitting stage and in the gstat
function call, we dealt with the first issue. As for the second issue, it is necessary for us to provide an elevation value for each point where we try to predict the temperature (otherwise the trend cannot be calculated). Since the points are specified in the form of a raster, it is only natural that the covariate values will come from the raster values themselves. This is exactly what the interpolate
function expects, but in order for this function to correctly identify which raster layers contain the values of the covariates, the respective layers must be named exactly the same way as the covariates are (in the formula
passed to autofitVariogram
and gstat
, as well as in the measurements point layer). In our case, there is only one covariate (elevation
) and one layer in the dem_spain
raster. Therefore, we just need to make sure that layer is named "elevation"
:
> names(dem_spain) [1] "spain_elev" > names(dem_spain) = "elevation" > names(dem_spain) [1] "elevation"
In addition, we need to make one final adjustment to the interpolate
function call—setting xyOnly=FALSE
, which instructs the interpolate
function to consider the raster values as covariates (rather than just coordinates of prediction points). The option xyOnly=FALSE
should be specified whenever we have a model with covariates:
> z = interpolate(dem_spain, g, xyOnly = FALSE)
When running the previous expression, you should see the [using universal kriging]
message on screen. Let's plot the UK predictions:
> z = mask(z, dem_spain) > plot(z) > plot(dat, add = TRUE)
The graphical output is shown in the following screenshot:
The preceding screenshot quite obviously resembles Spain's DEM. In fact, it shows the elevation profile of Spain transposed to temperature units (using a linear relation based on the data from meteorological stations) and locally calibrated using, once again, the stations data. This is what UK results commonly look like.
Let's examine the RMSE of the UK prediction model as well:
> cv = gstat.cv(g) > rmse(cv) [1] 1.455196
We see that the RMSE of UK, in this case, is even lower than that of OK. This result suggests that utilizing the value~elevation
trend has further improved prediction accuracy.