Chapter 8. Spatial Interpolation of Point Data

Spatial interpolation is an example of a geostatistical analysis technique with a wide range of applications. In this chapter, we are going to learn how spatial interpolation can be carried out in R through examples of interpolating meteorological point measurements to create annual temperature maps of Spain.

The purpose of this exercise is two-fold. First, we will see how several common interpolation methods are applied in practice in R. We will see, for instance, that specialized classes are used to represent the input data and/or the results of statistical analyses in R, and witness the advantages of such an approach. Second, we will see how, through the use of loops, we can automate complex tasks such as spatial interpolation, and perform them repeatedly in order to accomplish otherwise unfeasible tasks.

Note

In this chapter, we are going to use objects previously created in Chapter 3, Working with Tables.

In this chapter, we'll cover the following topics:

  • Spatially interpolating point data
  • Calculating an empirical variogram
  • Automatically fitting variogram models
  • Calculating the root mean square error (RMSE) of prediction

Spatially interpolating point data

Spatial interpolation is the procedure by which the behavior of a certain phenomenon of interest is predicted in locations where it has not been measured. For this purpose, we need a spatial prediction model—a set of procedures to obtain the predicted values given the calibration data. The two types of calibration data usually encountered are:

  • Field measurements: Available for a limited set of locations (usually points), for example, meteorological data from stations in Spain
  • Covariates: Available for each location within the area of interest, for example, elevation data from Spain's DEM

The spatial prediction model of our choice is calibrated using the calibration data. This model can then be used to calculate the predicted level of the phenomenon of interest in any location (usually points). The two main types of spatial interpolation methods recognized are:

  • Deterministic model: In this model, model parameter values are arbitrarily determined
  • Statistical model: In this model, model parameter values are objectively estimated from specific calibration data

Simply put, while both types of models use the calibration data to make predictions, deterministic models are predefined using fixed formulas to calculate predicted values given the calibration data; on the other hand, statistical models use the calibration data twice, first to fine-tune the model itself and then (as in the deterministic models) to calculate the predicted values. Examples of deterministic models are inverse distance weighted (IDW) interpolation and splines. Examples of statistical models are kriging and ordinary regression.

Note

Our purpose in this chapter is to present spatial interpolation from a practical point of view. Comprehensively covering the statistical theory and equations behind spatial interpolation techniques is well beyond the scope of this book. Readers familiar with the subject of spatial interpolation will, hopefully, get a perspective from this chapter on the way statistical tools such as these are applied in R. Readers unfamiliar with the subject of spatial interpolation are referred to the not-too-technical overview in the first two chapters of Tomislav Hengl's book A Practical Guide to Geostatistical Mapping, published in 2009 (available online), for supplementary reference. Those uninterested in this particular subject may skip to the last image in this chapter, and the few paragraphs that follow, where a dataset that we are going to use in the next chapter is described.

Two different approaches, or point of views, in selecting the appropriate spatial prediction model are also worth mentioning:

  • If we wish to prepare the most accurate predictions of a certain phenomenon (for example, to produce a map), we will be mainly interested in selecting the model that minimizes prediction error (and may not care very much what kind of model that is)
  • If we are interested in better understanding the phenomenon itself, we will seek to find the best model characterizing the observed pattern and estimating its parameters and our confidence in them (while prediction error will be of secondary interest)

Here, we strictly follow the first approach; we will test several commonly used models to see which one produces the least number of prediction errors, and then use it to produce continuous maps of temperature in Spain. However, we should remember that we are not investigating the mechanisms underlying the behavior of temperature in space, which would suggest the second approach. Diggle and Ribeiro's book Model-based Geostatistics (2007) is a good example that focuses on the latter approach utilizing another R package (geoR) for geostatistical analysis, which we do not cover here.

In the examples in this chapter, we will be working with three data sources: the spain_stations.csv and spain_annual.csv tables that hold the annual climatic data from Spain (see Chapter 3, Working with Tables), and the spain_elev.tif raster with Spain's DEM (see the previous chapter). The first thing we will do is assemble these objects into memory. We will also perform a four-fold aggregation of the DEM, from a 900 meter resolution to a 3,600 meter resolution, in order to make subsequent calculations faster:

> library(raster)
> dem_spain = raster("C:\Data\spain_elev.tif")
> dem_spain = aggregate(dem_spain, 4)
> spain_stations = read.csv("C:\Data\spain_stations.csv",
+ stringsAsFactors = FALSE)
> spain_annual = read.csv("C:\Data\spain_annual.csv",
+ stringsAsFactors = FALSE)

As we have already seen in previous chapters, spain_stations holds spatial location information for the 96 meteorological stations of Spain, while spain_annual holds the meteorological data itself. The tables can be joined by the common station column. The meteorological data consists of average measurements of three variables (precipitation, minimum temperature, and maximum temperature) obtained in different years (1984-2013). We are going to interpolate measurements for different year/variable combinations.

Therefore, it will be useful to write a function that accepts the variable and year we are currently interested in, and returns a point layer with the required data. The code for such a function consists of procedures already familiar to the reader from the previous chapters, so we will review its contents briefly. The function accepts five arguments:

  • stations: A data.frame object with location data for stations (in our case, spain_stations)
  • annual: A data.frame object with the annual meteorological data (in our case, spain_annual)
  • year: The year we would like to get meteorological data for
  • variable: The meteorological variable we would like to get data for
  • new_proj: The CRS for the output SpatialPointsDataFrame object as a PROJ.4 string

Utilizing these arguments, the function performs the following steps:

  1. Takes the stations table and converts it to a SpatialPointsDataFrame object.
  2. Defines the CRS of stations (geographical CRS).
  3. Removes Canary Islands stations (retaining stations eastwards to the 20°W meridian, that is, stations with x coordinate > -10; see the map of Spain's mainland and the Canary Islands to understand why).
  4. Subsets the meteorological variable and the required year.
  5. Joins the meteorological data with the attribute table of stations.
  6. Removes stations with NA for the respective year/variable combination.
  7. Transforms stations to the CRS defined by new_proj and returns the resulting object.

The function code is as follows:

> create_pnt = function(stations,annual,year,variable,new_proj) {
+ library(plyr)
+ # (1) Promoting stations to SpatialPointsDataFrame
+ coordinates(stations) = ~ longitude + latitude
+ # (2) Defining geographic CRS
+ proj4string(stations) = CRS("+proj=longlat +datum=WGS84")
+ # (3) Removing Canary Islands stations
+ stations = stations[coordinates(stations)[, 1] > -10, ]
+ # (4) Subsetting climatic data
+ annual = annual[
+ annual$year == year &
+ annual$variable == variable, ]
+ # (5) Joining meteorological data with stations layer
+ stations@data = join(stations@data, annual, by = "station")
+ # (6) Removing incomplete records
+ stations = stations[complete.cases(stations@data), ]
+ # (7) transforming to the required CRS
+ spTransform(stations, CRS(new_proj))
+ }

Note

Note that the stations and annual arguments are local objects that exist only when the function code is executed; they are not related to the spain_annual and spain_stations objects of the global environment, which are left unaltered (see Chapter 2, Working with Vectors and Time Series).

After reading the function into memory, let's give it a try. The following function call creates, for example, a SpatialPointsDataFrame object containing average annual minimum temperature for 2002 defined in the CRS of dem_spain:

> dat = create_pnt(stations = spain_stations,
+ annual = spain_annual,
+ year = 2002,
+ variable = "mmnt",
+ new_proj = proj4string(dem_spain))

We can examine the first few rows of the attribute table of the resulting object dat to see that it, indeed, contains minimum temperature values ("mmnt") from the year 2002:

> head(dat@data)
            station elevation variable year     value
1 GHCND:SP000003195       667     mmnt 2002 10.475000
2 GHCND:SP000004452       185     mmnt 2002 10.775000
3 GHCND:SP000006155         7     mmnt 2002 14.491667
4 GHCND:SP000008027       251     mmnt 2002 11.100000
5 GHCND:SP000008181         4     mmnt 2002 11.608333
6 GHCND:SP000008202       790     mmnt 2002  5.983333

Since the Canary Island stations were removed altogether (see step 3), and stations that do not have a minimum temperature record for 2002 were subsequently filtered out (see step 6), the number of features in the point layer is always smaller than 96 and differs between year/variable combinations. For example, dat consists of 75 points:

> nrow(dat)
[1] 75

In the following sections, you are going to learn how to interpolate the dat temperature records using different methods. Afterwards, we are going to write code to automatically evaluate the prediction error for each model. Finally, we will interpolate data for several years and variables at once using a loop.

Nearest-neighbor interpolation

Before we begin our journey through the three main interpolation methods featured in this chapter (see the next three sections), it is worth pointing out the principle common to most spatial interpolation methods: that the states of the phenomenon of interest are more similar among locations nearer to each other than among locations further apart. In other words, the phenomenon is autocorrelated in space (otherwise predictions should simply be reduced to a global mean, or calculated based on a nonspatial prediction model, such as ordinary regression). With spatial autocorrelation present, it makes sense for predicted values to be more similar to measured values that are nearest to them, and less similar to measured values further apart. For example, if we have a temperature measurement of 20°C at location A, it would be reasonable to assume that the temperature in location B, which is, say, 100 meters away from A, will be fairly similar to 20°C since the air temperature is a spatially autocorrelated phenomenon. The way and the degree to which measured values affect predictions as a function of distance is the main feature that differentiates the various interpolation methods.

The simplest interpolation method based on spatial autocorrelation is nearest-neighbor interpolation. However, it is rarely used in practice in the context of mapping for reasons that will become apparent shortly. It is hereby presented only to demonstrate the concept of spatial interpolation in its simplest form.

In nearest-neighbor interpolation, each predicted location simply takes the value of the nearest measured location. For instance, returning to the previous temperature example, location B will receive a predicted value of exactly 20°C unless there is another measured point within fewer than 100 meters from B (and then it will determine B's predicted value instead of A). We will now perform nearest-neighbor interpolation of the temperature data in dat to understand this point more clearly. We will hereby write code that performs nearest-neighbor interpolation. It is going to consist of the following steps:

  1. Creating a set of points for which we would like to make predictions (grid).
  2. Finding out which point in dat is the one closest to each point in grid.
  3. Assigning nearest-neighbor values of dat to the grid points.

As for the first step, prediction points are most commonly determined using a regular grid (that is, a raster) so that as a result of interpolation we would get a continuous predicted surface. Following this approach, we will use dem_spain for our grid of prediction points. However, since we are going to calculate distances between pairs of points using the gDistance function (which operates on vector layers; see Chapter 5, Working with Points, Lines, and Polygons), the raster cells should first be converted to points with rasterToPoints (see the previous chapter) as follows:

> grid = rasterToPoints(dem_spain, spatial = TRUE)

Moving on to the second step, now we can calculate the distance matrix between each point in dat and each point in grid:

> library(rgeos)
> dist = gDistance(dat, grid, byid = TRUE)

The resulting object, dist, is a matrix with 39,250 rows (corresponding to the grid features, that is, the number of cells of dem_spain that are not NA) and 75 columns (corresponding to the dat features):

> dim(dist)
[1] 39250    75

The values of this matrix are the pairwise distances, in meters, between the dat and grid features. In order to assign the value of the nearest points in dat to grid, we need to find out which point is the nearest neighbor in each case. This can be achieved with apply and which.min. The following expression yields a vector of indices indicating the minimal element in each row of dist:

> nearest_dat = apply(dist, 1, which.min)

Since nearest_dat now holds the indices of the dat features from which we need to obtain the temperature value for each predicted point in grid, the following expression assigns those values to a new column, named nn, in the attribute table of grid:

> grid$nn = dat$value[nearest_dat]

We have now completed the third step. What is left is to convert grid back to a raster, that is, to rasterize it (see the previous chapter) for easier visualization. The following expression rasterizes grid, with the field parameter set to the "nn" attribute table column, thus transferring its values to the dem_spain raster:

> grid = rasterize(grid, dem_spain, "nn")

The result is a nearest-neighbor predicted surface with the value of each pixel being the predicted temperature at the respective location. Let's see what it looks like using the following expressions:

> plot(grid)
> plot(dat, add = TRUE)

The graphical output is shown in the following screenshot:

Nearest-neighbor interpolation

The preceding screenshot clearly shows that the nearest-neighbor interpolation method has, in practice, divided the area of interest into discrete subareas. Each subarea includes all the locations that are more proximate to a given measured point (that is, a given meteorological station) than to any other. In fact, we could have also carried out nearest-neighbor interpolation by creating polygons defining such subareas and rasterized them, to produce the same result. Within the context of GIS, such polygons are known as Thiessen or Voronoi polygons (interested readers will find at least one way to create them in R; for example, using the dirichlet function of the spatstat package).

As mentioned earlier, nearest-neighbor interpolation is rarely useful to predict the behavior of natural phenomena in space. For example, it is clearly unrealistic to assume that discrete polygonal areas surrounding each meteorological station have uniform temperatures, with a sharp increase or decrease in temperature, along the borders between them (as the previous screenshot suggests). In reality, the temperature changes more or less gradually from place to place. To describe such gradual transitions, we need to use more elaborate methods, such as those described in the upcoming sections.

IDW interpolation

The three interpolation methods we will employ next—IDW, Ordinary Kriging (OK), and Universal Kriging (UK)—all utilize the same principal procedure, where the predicted value of a given point is determined as a weighted average of the values from the measured points. Moreover, the weight of each measured value is always a function of its distance from the point we are trying to predict (a nearby point usually having a higher weight, or influence, on the predicted value than a point further away). Within this framework, nearest-neighbor interpolation is, in fact, an extreme private case, where the weight of the nearest point is 1 while the weights of all other points are 0.

To better understand the subsequent material in this chapter, we can already state that the differences between the three methods concern two properties: the trend and the weights definition. The trend is basically an additional function, independent of the measured points values, added to the weighted average of the latter. In IDW, the trend is 0 and in OK it is a constant value, thus having no effect on the predicted pattern in either case. However, in UK, the trend is a function of additional covariates. Regarding the weights, in IDW they are arbitrarily determined (thus the method is considered deterministic); in OK and UK, conversely, they are estimated from the data itself (thus the methods are considered statistical).

As noted earlier, the predicted values in all three methods are the weighted averages of measured points:

IDW interpolation

Here, IDW interpolation is the predicted value at location 0, IDW interpolation is the measured value i, and IDW interpolation is the weight for the measured value i, while n is the total number of measured points.

In IDW, the weight IDW interpolation is a function of the inverse distance, shown as follows:

IDW interpolation

Here, IDW interpolation is the geographic distance between the measured point i and the predicted point at location 0 to the power IDW interpolation, and n is the total number of measured points. The default value for IDW interpolation is 2. This means that the importance of each measured point in determining a predicted value diminishes as a function of squared distance. When IDW interpolation is smaller, the predicted surface will be smoother; when IDW interpolation is larger, the predicted surface will be less smooth, giving more emphasis to the nearest neighbor.

We will now interpolate the temperature data in dat using the IDW method. To apply this method, and subsequent ones later in this chapter, we will use functions in the gstat package. This package provides extensive capability for univariate and multivariate geostatistical analysis, of which we will see a few examples.

Note

For an overview of gstat, see the official introductory document at http://cran.r-project.org/web/packages/gstat/vignettes/gstat.pdf.

The core of the gstat package is a function named gstat, which is used to produce objects (of a class also named gstat) that hold all necessary information to perform spatial interpolation (as well as other operations such as cross-validation). Among the numerous parameters that the gstat function accepts, there is, in fact, no parameter where we can specify the desired interpolation method. Instead, when interpolation is triggered on a gstat object, the method is determined based on the input data the object contains. In this chapter, we are going to explore only a portion of the functionality of gstat, utilizing three of its parameters:

  • formula: The formula that defines the dependent and independent variables (an object of the formula class)
  • data: The measured point data (a SpatialPointsDataFrame object; it can also be a data.frame object but we will not use this option)
  • model: The variogram model (an object of the variogramModel class)

These three parameters determine whether IDW, OK, or UK is used based on the following decision tree:

  • If there is no variogram model (model=NULL, the default), IDW is used
  • If a variogram model is passed to model, and the formula:
    • contains no independent variables, OK is used
    • contains at least one independent variable, UK is used

Our dependent variable column is value and an intercept-only model (that is, a model with no independent variables) formula in R is, by convention, specified by ~1. Therefore, the following expression creates a gstat object (named g) that holds the necessary information to predict the minimum temperature based on the measured data in dat using IDW:

> library(gstat)
> g = gstat(formula = value ~ 1, data = dat)

Printing g shows a summary of the information it contains as follows:

> print(g)
data:
var1 : formula = value`~`1 ; data dim = 75 x 5

Note

Note that this is just a summary; g, in fact, also contains the dat object itself, with the measured values and independent variables (if any), since it serves as the calibration data necessary to actually perform the interpolation.

To interpolate, what we now need is to provide the points for which we would like to make predictions. Conveniently, the raster function includes a function named interpolate that, given a raster (the object parameter) and a spatial prediction model (the model parameter), yields a new raster with predicted values. This way, we do not need to make the manual conversions to and from a raster, as we did in nearest-neighbor interpolation. For example, the following expression uses the model and calibration data held in g to make temperature predictions for the grid defined by dem_spain:

> z = interpolate(dem_spain, g)

There are two things to note here:

  • First, as you remember, text messages are omitted in this book to save space. However, readers who run the preceding expression in R will see the message [inverse distance weighted interpolation] appearing on the screen. This reassures the user that indeed the expected type of model has been chosen by gstat.
  • Second, it is important to understand that, if we do not use covariates (that is, independent variables in the model formula), the dem_spain raster serves only as a grid pointing to the locations where we would like to calculate predicted values (the raster values play no role in the interpolation). This is opposed to UK interpolation, where the raster serves both to point at predicted locations and provide covariate values for each location.

Plotting the predicted surface z will show that predicted values were generated for all raster cells (including NA cells). It is usually reasonable to make predictions for a more specific extent, such as the land area of Spain. We can use mask to remove the unnecessary predictions:

> z = mask(z, dem_spain)

Now let's plot the predicted value raster, and the station locations on top of it, with the following pair of expressions:

> plot(z)
> plot(dat, add = TRUE)

The graphical output is shown in the following screenshot:

IDW interpolation

In the preceding screenshot, we see the predicted average minimum temperature for 2002 in Spain. It seems the central-northern parts of the country were generally colder, while the south was warmer (this general pattern was also evident in the nearest-neighbor interpolation image). The effect of the station locations on the predicted surface is also notable. As you remember, the weight of a given station increases as the predicted point is nearer. Therefore, each point generates a sort of circular zone of influence around itself, especially if its value markedly differs from the values of its proximate neighbors.

To understand the role of the IDW interpolation parameter (see equation (2)) more vividly, let's produce two more IDW-predicted surfaces: one with IDW interpolation set to a very small value (say, 0.3) and one with IDW interpolation set to a very large value (say, 30). Then, we will compare the results with the preceding graphical output (where IDW interpolation was set to 2, which is the default value). The following code generates the two additional predicted surfaces and assigns them to objects z1 and z2, respectively:

> g1 = gstat(formula = value ~ 1, data = dat, set = list(idp=0.3))
> g2 = gstat(formula = value ~ 1, data = dat, set = list(idp=30))
> z1 = interpolate(dem_spain, g1)
> z2 = interpolate(dem_spain, g2)
> z1 = mask(z1, dem_spain)
> z2 = mask(z2, dem_spain)

Note that the set=list(idp=0.3) part determines the value of IDW interpolation in gstat. Using the following expression, we can visualize the three rasters: z1, z, and z2 (the pch and cex parameters are used to control the symbol type and size, respectively, when drawing the dat points):

> plot(z1, main = c("beta = 0.3"))
> plot(dat, add = TRUE, pch = 20, cex = 0.5)
> plot(z, main = c("beta = 2"))
> plot(dat, add = TRUE, pch = 20, cex = 0.5)
> plot(z2, main = c("beta = 30"))
> plot(dat, add = TRUE, pch = 20, cex = 0.5)

The graphical outputs shown here, side by side, are produced as a result. The respective values of IDW interpolation are stated above each output:

IDW interpolation

These images demonstrate the previously stated expectations. Decreasing IDW interpolation (the leftmost image) results in a smoother surface due to more homogeneous weight distribution among the measured points. On the other hand, increasing IDW interpolation (the rightmost image), results in sharper boundaries between the zones of influence of each point, resembling nearest-neighbor interpolation results in appearance.

Now that we have produced several interpolation results, how can we assess the predictive ability of this (or any other) interpolation method? The obvious way is to compare predicted values to observed ones, preferably when the validation points come from an independent dataset—one that did not participate in model calibration. Since measurements are usually scarce and valuable (we do not have an infinite number of meteorological stations), however, we would like to include each and every measurement in the model calibration, leaving no measurements for validation. The compromise approach to this problem is the process of cross-validation, where we set aside some of the observed points for validation. In the process of cross-validation, we then try to predict their values using the remaining points and compare the predicted values to the observed ones. A special case of cross-validation is leave-one-out cross-validation (LOOCV), where every observed point is removed in turn, its value being predicted based on the remaining points to obtain a table with observed and predicted values for all observations. Assessing the differences between observed and predicted values based on such a table gives us a measure of the accuracy of our prediction model.

The gstat.cv function can automatically perform cross-validation given an object of the gstat class. The default method is LOOCV; therefore, in the following expression, all we need to do is perform LOOCV on the IDW interpolation of temperature data:

> cv = gstat.cv(g)

This expression, in practice, triggers the execution of 75 spatial interpolation operations, each time leaving out one meteorological station and trying to predict its temperature value using the 74 remaining ones. The result is a SpatialPointsDataFrame object with the same number of features as dat and with the same spatial locations. Only the attribute table is different, now containing the cross-validation results:

> head(cv@data)
  var1.pred var1.var  observed   residual zscore fold
1  8.971923       NA 10.475000  1.5030771     NA    1
2 10.244447       NA 10.775000  0.5305532     NA    2
3 10.914591       NA 14.491667  3.5770756     NA    3
4  9.992180       NA 11.100000  1.1078202     NA    4
5 12.024766       NA 11.608333 -0.4164331     NA    5
6  7.435642       NA  5.983333 -1.4523087     NA    6

The attribute table columns of the cross-validation result refer to the following information:

  • var1.pred: Predicted value
  • var1.var: Variance (only for kriging)
  • observed: Observed value
  • residual: Residual, the difference between observed and predicted values
  • zscore: Z-score (only for kriging)
  • fold: Cross-validation count

The important column, for our purpose here, is residual—the difference between the observed and predicted values. Based on the residuals, we can assess how far predicted values are from the observed ones, or in other words, prediction accuracy. One of the simplest and most common metrics of agreement between observed and predicted values is the RMSE. Given a set of predicted-observed value pairs, the RMSE is calculated as follows:

IDW interpolation

Here, RMSE is the root mean square error, n is the overall number of points, and IDW interpolation and IDW interpolation are the predicted and observed values at point i, respectively. As the difference between predicted and observed values is smaller, the RMSE will be lower, indicating that the model is more accurate. Given the same input data, we can compare the RMSE values among different models to select the most accurate one. Translating equation (3) to the R syntax, we can calculate RMSE based on the attribute table of cv as follows:

> sqrt(sum((-cv$residual)^2)/nrow(cv))
[1] 1.88049

The RMSE in this case is equal to 1.88049. Note that the expression consists of the -cv$residual vector (equivalent to cv$var1.pred-cv$observed) squared, then summed, and divided by the total number of points. Finally, the square root of the result is extracted.

For convenience, we can wrap this expression into our own function, which calculates RMSE, given a gstat.cv output:

> rmse = function(x) sqrt(sum((-x$residual)^2)/nrow(x))
> rmse(cv)
[1] 1.88049

We will use this function later.

Note

You may have noticed that the gstat object we initially defined (g) was hereby used for two distinct and independent purposes. First, to calculate predicted values (using the interpolate function), then to perform cross-validation (using the gstat.cv function). This is not incidental, but rather a characteristic object-oriented behavior common to all, or most, of R's statistical procedures. This approach is advantageous over the menu-based approach commonly encountered in statistical or GIS software, where we carry out a given analysis by making selections in a set of dialog boxes, and receive a set of results (some of which we do not need, while others are lacking since the tool's creator did not include them). Conversely, when a statistical function is applied in R, we usually get an object that holds all the data that is necessary to carry out the analysis (such as g) or an object that holds a comprehensive set of results (such as cv). From such objects, we can then derive the results we are interested in by applying specific functions. Moreover, we can create functions of our own to carry out calculations or extract meaningful data, for which no built-in methods have been defined (such as the rmse function we defined earlier). Readers who are interested in using R as a general statistical analysis toolbox will repeatedly encounter this approach. For example, the lm function is used to perform ordinary linear regression in R; on its returned object, of class lm, numerous functions (such as summary, residuals, and plot) may be applied to derive the information we are interested in.

Interpolation using Ordinary Kriging

In kriging, as mentioned earlier, the weight Interpolation using Ordinary Kriging given to each measurement when calculating a predicted value (see equation (1)) is determined statistically, rather than arbitrarily (such as by choosing the value of Interpolation using Ordinary Kriging ourselves in equation (2)). The function determining weights in kriging is called a variogram model. The variogram model is a function fitted to the empirical variogram, which in turn describes the spatial autocorrelation structure of the observed pattern. What is important to understand is that the empirical variogram describes the average degree of autocorrelation between observed values, and the variogram model is a continuous function fitted to these data. The variogram model determines, in plain terms, the importance of points nearby and further away in the calculation of predicted values. For example, if autocorrelation is high over short distances, it would make sense to let nearby points largely determine the predicted value, making the predicted surface rougher; when autocorrelation is low, the weight distribution over distance may be more uniform and the predicted surface smoother. Elaborating on the subject of variogram modeling is beyond the scope of this book; interested readers can find more information in A practical guide to geostatistical mapping of environmental variables, Hengle, T. (2007) or in any textbook on geostatistics.

An empirical variogram can be calculated using the variogram function, once again based on a gstat object. Let's take a look at the following example:

> ev = variogram(g)

The returned object of the gstatVariogram class contains semivariance values (expressing the degree of correlation) for different distance bins. Note that there are numerous optional parameters controlling the way a variogram is computed, that we will not go into, such as the way the latter bins are determined (see ?variogram). The plot method for a gstatVariogram object plots the empirical variogram as follows:

> plot(ev)

The following screenshot shows how the plot appears in our case:

Interpolation using Ordinary Kriging

We can see that semivariance increases with distance. This means that temperature values are more diverse when considering stations further apart. In other words, temperature is spatially autocorrelated.

As mentioned earlier, a variogram model is a function fitted to the empirical variogram. There are several specific functions that are adequate to describe the characteristic behavior of an empirical variogram (increasing semivariance with distance, usually reaching saturation at some point). These functions, such as the spherical and exponential functions, are generally used for a variogram model (type vgm() for a list of the functions implemented in gstat or show.vgms() for a visual). In addition to choosing one of these functions, we need to decide on its specific parameters. There are three main approaches, not mutually exclusive, to select a variogram model: a function plus its set of parameters (based on an empirical variogram):

  • Setting the function and its parameters by hand (for example, using the vgm function from the gstat package)
  • Visually fitting a function to the empirical variogram by interactively varying the function type and its parameters and seeing how its appearance changes (for example, using the eyefit function from the geoR package)
  • Statistically selecting the function and parameters that minimize some goodness-of-fit criterion (for example, using the fit.variogram function from the gstat package, the variofit function from the geoR package, or the autofitVariogram function from the automap package)

Since we are hereby taking a simple, practical approach, we will use the third option of letting the computer select the function that follows the empirical variogram most closely. Moreover, we are going to bypass the requirement to supply the function type and initial parameter estimates (as required, for example, in fit.variogram) and use a wrapper function named autofitVariogram that automatically optimizes this decision for us.

The autofitVariogram function is defined in the automap package. It accepts the formula and data arguments (named input_data) analogously for gstat. The following expression, therefore, fits a variogram model to our data:

> library(automap)
> v = autofitVariogram(formula = value ~ 1, input_data = dat)

Plotting the returned object of the autofitVariogram class generates a plot with an empirical variogram and the fitted variogram model on top of it:

> plot(v)

The resulting output is shown in the following screenshot:

Interpolation using Ordinary Kriging

The points, once again, show an empirical variogram, while the continuous line is the fitted variogram model. In this case, for instance, the selected model was "Matern, M. Stein's parameterization" (code name "Ste"), with the four parameter values stated in the bottom-right corner of the image. The object v contains several components, such as the empirical variogram data points and the sums of squares between the sample variogram and the fitted variogram model. The component we are interested in is the fitted variogram model. This components' name is var_model, and it can be accessed with the $ operator, using the expression v$var_model (this behavior stems from the fact that an autofitVariogram class is built upon a list, and list elements can be accessed the same way as data.frame columns with $).

The important point here is that the v$var_model component of v is an object of class variogramModel, containing the function type and parameters defining a variogram model. A variogramModel object can be passed to the gstat function to indicate a variogram model. Therefore, we can create an object containing all the necessary information to perform the OK interpolation as follows:

> g = gstat(formula = value ~ 1, data = dat, model = v$var_model)
> g
data:
var1 : formula = value`~`1 ; data dim = 75 x 5
variograms:
        model      psill    range kappa
var1[1]   Nug  0.4594258      0.0   0.0
var1[2]   Ste 10.9549186 626562.4   0.3

The printed output shows that the object indeed contains a variogram model. We can now use the interpolate function to produce the predicted OK surface, and the mask function to clip the area of interest, exactly the same way we did in IDW interpolation:

> z = interpolate(dem_spain, g)
> z = mask(z, dem_spain)

Those readers who execute the above expression in R should see, this time, the message [using ordinary kriging] on screen.

Let's plot the interpolation result with the following expressions:

> plot(z)
> plot(dat, add = TRUE)

The graphical output is shown in the following screenshot:

Interpolation using Ordinary Kriging

One of the notable visual differences in predicted rasters between IDW and kriging, in general, is that the weight of a given measured point always approaches 1 in IDW as we get nearer. Therefore, the values around each measurement point tend to approach the measured value itself (see the previous screenshot); in kriging, however, this is not necessarily so.

We can now perform LOOCV of the OK prediction model in order to compare the resulting RMSE with that of IDW:

> cv = gstat.cv(g)
> rmse(cv)
[1] 1.680153

The RMSE is smaller in this case, which suggests that OK produces more accurate prediction than IDW interpolation.

Using covariates in Universal Kriging interpolation

Universal Kriging (also referred to as Kriging with External Drift or Regression Kriging) is, in fact, a general model of which OK is a special case. While both methods involve spatial autocorrelation modeling, they differ in the definition of the trend. In OK, the trend is a constant value, while the trend is a linear function of one or more covariates in UK. In other words, in OK the predicted value is some constant plus a weighted function of neighboring measurements, while it is a predicted value based on covariates plus a weighted function of neighboring measurements in UK.

For instance, it is well known that temperature is negatively correlated with elevation (it gets colder as we climb to a higher altitude). Using a simple scatterplot, we can see that this rule holds true in the present case as well:

> plot(value ~ elevation, dat@data, 
+ xlab = "Elevation (m)", 
+ ylab = "Temperature (degrees Celsius)")

The resulting plot clearly shows the negative relationship between elevation and temperature, although the relation is not perfect:

Using covariates in Universal Kriging interpolation

Sometimes the temperature for a given altitude is lower than usual and sometimes higher. Can we create a model where the temperature is predicted using elevation and then fine-tuned using nearby meteorological measurements? UK does exactly this by adding a weighted average of stations values (as OK does) to a general trend surface defined by a linear function based on covariates (in this case, elevation).

To interpolate the temperature data using UK with elevation as a covariate, we need to make two principal alterations compared to the OK procedure:

  • The variogram model is constructed according to the so-called residual variogram, which portrays the spatial autocorrelation of the residuals from the chosen trend
  • The predicted values are calculated based on both the neighboring measured values (as in OK) and the covariates; thus, each point we would like to predict must be accompanied by its respective set of covariate values

Starting with the variogram model, as opposed to OK, our formula now contains elevation as an independent variable:

> v = autofitVariogram(formula = value ~ elevation,
+ input_data = dat)

Plotting the new v object will show that both the empirical variogram and the variogram model are different from what we previously had since they now concern the residuals from the elevation trend. Next, we need to create the gstat object, which contains the information to perform UK interpolation, supplying value~elevation as the formula and v$var_model as the variogram model:

> g = gstat(formula = value ~ elevation,
+ data = dat,
+ model = v$var_model)
> g
data:
var1 : formula = value`~`elevation ; data dim = 75 x 5
variograms:
        model     psill    range kappa
var1[1]   Nug 0.7410872     0.00   0.0
var1[2]   Ste 0.9696567 35781.93   0.2

By supplying the appropriate formula in the variogram fitting stage and in the gstat function call, we dealt with the first issue. As for the second issue, it is necessary for us to provide an elevation value for each point where we try to predict the temperature (otherwise the trend cannot be calculated). Since the points are specified in the form of a raster, it is only natural that the covariate values will come from the raster values themselves. This is exactly what the interpolate function expects, but in order for this function to correctly identify which raster layers contain the values of the covariates, the respective layers must be named exactly the same way as the covariates are (in the formula passed to autofitVariogram and gstat, as well as in the measurements point layer). In our case, there is only one covariate (elevation) and one layer in the dem_spain raster. Therefore, we just need to make sure that layer is named "elevation":

> names(dem_spain)
[1] "spain_elev"
> names(dem_spain) = "elevation"
> names(dem_spain)
[1] "elevation"

In addition, we need to make one final adjustment to the interpolate function call—setting xyOnly=FALSE, which instructs the interpolate function to consider the raster values as covariates (rather than just coordinates of prediction points). The option xyOnly=FALSE should be specified whenever we have a model with covariates:

> z = interpolate(dem_spain, g, xyOnly = FALSE)

When running the previous expression, you should see the [using universal kriging] message on screen. Let's plot the UK predictions:

> z = mask(z, dem_spain)
> plot(z)
> plot(dat, add = TRUE)

The graphical output is shown in the following screenshot:

Using covariates in Universal Kriging interpolation

The preceding screenshot quite obviously resembles Spain's DEM. In fact, it shows the elevation profile of Spain transposed to temperature units (using a linear relation based on the data from meteorological stations) and locally calibrated using, once again, the stations data. This is what UK results commonly look like.

Let's examine the RMSE of the UK prediction model as well:

> cv = gstat.cv(g)
> rmse(cv)
[1] 1.455196

We see that the RMSE of UK, in this case, is even lower than that of OK. This result suggests that utilizing the value~elevation trend has further improved prediction accuracy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset