So far, we have covered operations to transform a vector layer to a raster and vice versa. The third operation involving vector layers and rasters, and the focus of this final section, is the extraction of raster values according to vector layers. We are often interested in reducing or summarizing raster data using point, line, or polygon features, which is when this operation comes in handy. For example, we may wish to calculate the elevation profile covered by a GPS track (raster-to-line extraction) or the average NDVI of a given forest (raster to polygon extraction). In this section, we will see two examples involving extraction by points and extraction by polygons.
Extraction of raster values, according to a vector layer of any kind, can be done with the extract
function. The first two parameters of this function are as follows:
x
)y
) fromWhen extracting values according to points, which is the simplest extract scenario, the returned object may be either of the following:
As an example of extracting single band raster values to points, we will use the spain_stations.csv
file we previously created (see Chapter 3, Working with Tables), which contains a table with the spatial location records of 96 meteorological stations in Spain. We will create a SpatialPointsDataFrame
object based on this table, and then use it to extract elevation values from a DEM of Spain. The DEM will be obtained from a file (available on the book's website) named spain_elev.tif
.
First, we will read the DEM file into R. This file, by the way, was created by merging SRTM tiles downloaded with getData
—exactly as we did to create the Haifa DEM in the previous chapter—followed by reprojection, masking, and aggregation to a 900 meter resolution.
> dem_spain = raster("C:\Data\spain_elev.tif")
Now, we will read the station records from spain_stations.csv
:
> stations = read.csv("C:\Data\spain_stations.csv", + stringsAsFactors = FALSE)
As you remember, the station locations are stored in the longitude
and latitude
columns; therefore, the data.frame
object can be promoted to a SpatialPointsDataFrame
object as follows:
> coordinates(stations) = ~ longitude + latitude
The station coordinates are given in degrees according to a geographic CRS. Therefore, we will define this CRS, and then reproject stations
to match the CRS of dem_spain
(UTM zone 30N) as follows:
> proj4string(stations) = CRS("+proj=longlat +datum=WGS84") > stations = spTransform(stations, CRS(proj4string(dem_spain)))
Let's plot stations
on top of dem_spain
to see that they indeed match in location:
> plot(dem_spain) > plot(stations, add = TRUE)
The graphical output is shown in the following screenshot:
What we see here is the elevation map of Spain (the dem_spain
raster), with station locations on top (stations
point layer). The stations seem to cover the whole area of the country, more or less evenly. Note that some of the stations are not shown since they are located on the Canary Islands and thus they are beyond the scope of the dem_spain
raster.
To extract the elevation values underlying each station, we employ the extract
function with the raster and the points as the first two arguments, respectively. Since the returned object is going to be a numeric vector with the respective elevation values for each station, we can directly assign it to a new column in stations
named elev_dem
:
> stations$elev_dem = extract(dem_spain, stations)
Examining the attribute table of stations
shows that now we have two elevation entries per station. These are the elevation values originally provided along with the meteorological records (the elevation
column) and the elevation values we just obtained from the DEM (the elev_dem
column):
> head(stations@data) station elevation elev_dem 1 GHCND:SP000003195 667 651.96 2 GHCND:SP000004452 185 184.24 3 GHCND:SP000006155 7 5.80 4 GHCND:SP000008027 251 212.44 5 GHCND:SP000008181 4 3.55 6 GHCND:SP000008202 790 787.28
Examining this table more closely will show that some of the stations were assigned with NA
in the elev_dem
column either since they are located near Spain's border (and incidentally outside the DEM scope) or since they are located in the Canary Islands (which the dem_spain
raster does not cover at all).
It would be interesting to see the degree of agreement between the two sources of information by plotting one vector as a function of the other:
> plot(elev_dem ~ elevation, stations, + xlim = c(0, 2000), + ylim = c(0, 2000), + xlab = "Elevation from station record (m)", + ylab = "Elevation from DEM (m)")
Note that the plot(elev_dem~elevation,stations)
expression is analogous to plot(stations$elev_dem~stations$elevation)
. In the latter syntax, the vector names are provided explicitly; while in the former (often more convenient), the formula addresses columns from a data.frame
object, provided as a second argument.
To make the assessment more convenient, we will also use the abline
function that can add a straight line to an existing plot. One way of specifying the line's location is providing its intercept (a
) and slope (b
). We are going to add a 1:1 line (that is, a line with an intercept of 0
and a slope of 1
), in order to see how well the elevation
and elev_dem
records match:
> abline(a = 0, b = 1, lty = "dotted")
The graphical output is shown in the following screenshot:
Each point corresponds to a single station, and we can see the respective elevation
and elev_dem
values on the x and y axes, respectively. The agreement between the two sources of information is, unsurprisingly, very good, except for a few stations that lie slightly farther from the 1:1 line.
When extracting raster values using line or polygon layers, each feature may correspond to more than one value in each raster layer (unlike with points, where each feature always corresponds to a single value). In fact, a variable number of values may be extracted for each feature, depending upon its shape. For example, the polygon delineating the Lahav forest, which we created earlier, covers 10,586 Landsat pixels and 38 MODIS pixels, while the Kramim polygon covers 7,538 Landsat pixels and 28 MODIS pixels. We have two ways to deal with this variation:
fun
parameter), in which case the returned object may be, just as with point layers:fun=NULL
, the default value), in which case the returned object will be a list
(with df=FALSE
, the default value) or a data.frame
object (with df=TRUE
)When a function is provided with the fun
parameter, the additional parameter na.rm
determines whether NA
cells are included in the calculation.
Proceeding with the forest
example, we will now complete the two remaining steps: extracting the NDVI data and arranging it in a table. We are going to extract NDVI values according to the forests
layer with the first alternative—summarizing the values (specifically with fun=mean
, giving a mean NDVI per forest). Our ultimate goal is to have, by the end of this chapter, a table with three ID columns (date
, sat –
satellite, and forest
) and a fourth column (ndvi
) holding the measured average NDVI values.
Starting with the assembly of NDVI images (ndvi_98
, ndvi_00
, ndvi_03
, and r
), we will read the Landsat images from 1998 and 2003 and calculate NDVI (assuming that r
and ndvi_00
are already in memory, as is our custom-made ndvi
function; see Chapter 4, Working with Rasters).
> l_98 = brick("C:\Data\landsat_15_10_1998.tif") > l_03 = brick("C:\Data\landsat_11_09_2003.tif") > ndvi_98 = calc(l_98, fun = ndvi) > ndvi_03 = calc(l_03, fun = ndvi)
Next, we will create a Date
object, named l_dates
, holding the Landsat image dates (see the filenames). We will use this object later when creating a table of results.
> l_dates = as.Date(c("1998-10-15", "2000-10-04", "2003-09-11"))
We are ready to proceed with the extraction—employing the extract
function on the three Landsat NDVI images to obtain the mean NDVI values per forest, per date:
> l_forests = extract(stack(ndvi_98, ndvi_00, ndvi_03), + forests, + fun = mean, + na.rm = TRUE)
Since we are extracting values from a multiband raster, yet employing a function (mean
) to summarize those values; the returned object, assigned to l_forests
, is a matrix
. Its two rows correspond to the forests
features, while its three columns correspond to the layers of stack(ndvi_98,ndvi_00,ndvi_03)
. For example, we can see that the average NDVI observed by Landsat on October 15, 1998 in the Lahav forest was 0.3053538
:
> l_forests layer.1 layer.2 layer.3 [1,] 0.3053538 0.2487563 0.284487 [2,] 0.2840073 0.2190098 0.243326
Right now we can already tell that, in both forests, NDVI decreased between 1998-2000 and then (incompletely) recovered between 2000-2003.
By repeating the same procedure with r
, we will create the analogous r_forests
matrix:
> r_forests = extract(r, + forests, + fun = mean, + na.rm = TRUE)
This time the matrix has 280 columns since r
has 280 layers:
> dim(r_forests) [1] 2 280
Proceeding with the third step, we would like to have the information from l_forests
and r_forests
in a single data.frame
object with all NDVI values in a single column, and additional columns characterizing the measurements (date
, sat
, and forest
). Starting with the l_forests
matrix, we will first transpose it (using the t
function) and convert it to a data.frame
object (using the as.data.frame
function):
> l_forests = as.data.frame(t(l_forests)) > l_forests V1 V2 layer.1 0.3053538 0.2840073 layer.2 0.2487563 0.2190098 layer.3 0.2844870 0.2433260
Now, we can set the appropriate column names (the forest names) and create new columns for dates (obtained from l_date
) and satellite ("Landsat"
) as follows:
> colnames(l_forests) = forests$name > l_forests$date = l_dates > l_forests$sat = "Landsat"
The new l_forests
matrix looks as follows:
> l_forests Lahav Kramim date sat layer.1 0.3053538 0.2840073 1998-10-15 Landsat layer.2 0.2487563 0.2190098 2000-10-04 Landsat layer.3 0.2844870 0.2433260 2003-09-11 Landsat
Exactly the same procedure is repeated for r_forests
(with acquisition dates taken from dates$date
and the satellite name set to "MODIS"
):
> r_forests = as.data.frame(t(r_forests)) > colnames(r_forests) = forests$name > r_forests$date = dates$date > r_forests$sat = "MODIS"
Now, we can combine the two data.frame
objects using rbind
:
> forests_ndvi = rbind(l_forests, r_forests)
The combined data.frame
object, which we named forests_ndvi
, contains all average NDVI records for the two forests from the two satellites, collectively for 283 dates (three dates from Landsat and 280 dates from MODIS). Its first few rows are printed as follows:
> head(forests_ndvi) Lahav Kramim date sat layer.1 0.3053538 0.2840073 1998-10-15 Landsat layer.2 0.2487563 0.2190098 2000-10-04 Landsat layer.3 0.2844870 0.2433260 2003-09-11 Landsat modis.1 0.3725111 0.3416607 2000-02-18 MODIS modis.2 0.3959158 0.3850857 2000-03-05 MODIS modis.3 0.4102210 0.3956179 2000-03-21 MODIS
What is left to be done is transform the data.frame
object to a longer form (see Chapter 3, Working with Tables), creating another column for forest identity and to transfer the NDVI values to a designated values column. This can be performed with the melt
function from the reshape2
package:
> library(reshape2) > forests_ndvi = melt(forests_ndvi, + measure.vars = forests$name, + variable.name = "forest", + value.name = "ndvi")
Note that the measured variables here are "Lahav"
and "Kramim"
, while the rest are treated as ID variables. Instead of typing the measure variable column names, we passed the forests$name
vector, which already contains the necessary names:
> forests$name [1] "Lahav" "Kramim"
The additional parameters variable.name
and value.name
in the melt
function call are used to specify the names of the newly created variable and value columns (to replace the default names "variable"
and "value"
, respectively). The final table looks as follows:
> head(forests_ndvi) date sat forest ndvi 1 1998-10-15 Landsat Lahav 0.3053538 2 2000-10-04 Landsat Lahav 0.2487563 3 2003-09-11 Landsat Lahav 0.2844870 4 2000-02-18 MODIS Lahav 0.3725111 5 2000-03-05 MODIS Lahav 0.3959158 6 2000-03-21 MODIS Lahav 0.4102210
In Chapter 9, Advanced Visualization of Spatial Data, we are going to use this table to create a plot of the NDVI evolution over time in these two forests. A table of this form, where:
constitutes a so-called tidy data table (see the paper by Hadley Wickham on this subject, 2014). As we shall see, bringing our data to such a form is often required to use more sophisticated graphical functions such as those in the ggplot2
package.