In this section, you will learn how to perform operations involving pairs of vector layers. These types of operations are very common in spatial data analysis. We often want to know, for instance:
All of these operations require the overlay of features from two distinct layers, although, as we have seen earlier, the result can be:
In this section, we will see examples of all three kinds of operations.
Querying relations between two layers is required when we would like to do one of the following:
TRUE
if polygon A is completely within polygon B, or FALSE
otherwise)In GIS terminology, the first type of operation is sometimes referred to as a spatial join or a join by spatial location. The way it can be done in R will be demonstrated using the airports
and county
layers. The second type of operation (examining relations) will be demonstrated using another example, involving layers of buildings and natural areas in London.
The over
function provides consistent functionality to join the attribute table data from one spatial object to another based on their intersection. All nine possible types of relations are permitted (point/line/polygon with point/line/polygon), either via the sp
package (point-point, point-polygon, and polygon-point) or rgeos
(all other combinations). The over
function accepts two spatial layers (parameters x
and y
). The function call over(x,y)
then returns—for all features in the first layer (x
)—the attribute table entries (or indices, if the layer has no attribute table) of the second layer (y
) that intersect it. The related syntax x[y,]
, when both x
and y
are vector layers, serves as a shortcut to over
when we are interested in subsetting the vector layer x
according to intersection. The latter function call returns only those features in x
that intersect with a feature in y
.
As already mentioned, our first example of querying relations between layers will involve the airports
and county
layers. The first preliminary step, as always, is to bring both layers to a common CRS. In this case, we will reproject airports
to the CRS of county
:
> airports = spTransform(airports, CRS(proj4string(county)))
We will now plot airports
on top of the county
layer to visually examine their relation so that we know what to expect later. Since we already know that all three airports are located in New Mexico, we will work with a subset of county
, called nm
, containing only the counties of New Mexico:
> nm = county[county$NAME_1 == "New Mexico", ] > plot(nm) > plot(airports, col = "red", pch = 16, add = TRUE)
Note that we used yet another parameter of the plot
function, pch
, to choose a different point shape (16
corresponds to the filled circles, while the default argument is a plus symbol +
).
The following screenshot shows the graphical output that is produced:
We can see that two of the airports fall within (thus, by definition, also intersect) a single county, while the third airports falls within a different county. Using over
with airports
and nm
(in that order) will return the following output:
> over(airports, nm) NAME_1 NAME_2 TYPE_2 FIPS area 1 New Mexico Bernalillo County 35001 3023.909 2 New Mexico Bernalillo County 35001 3023.909 3 New Mexico Santa Fe County 35049 4944.339
Indeed, we see that the first two rows are identical since the first two airports are located in the same county (Bernalillo county). To examine specifically which airport falls within each county, we can bind the result of over
with the attribute table of airports:
> cbind(airports@data, over(airports, nm)) name NAME_1 NAME_2 TYPE_2 FIPS 1 Albuquerque International New Mexico Bernalillo County 35001 2 Double Eagle II New Mexico Bernalillo County 35001 3 Santa Fe Municipal New Mexico Santa Fe County 35049 area 1 3023.909 2 3023.909 3 4944.339
Now we can tell that the Albuquerque International and Double Eagle II airports are located within Bernalillo County, while the Santa Fe Municipal airport is located within Santa Fe County. To permanently incorporate the county information into the attribute table of airports
, we can assign the combined table back to the attribute table slot of airports
, with an expression such as airports@data=cbind(airports@data,over(airports,nm))
.
Examining the opposite relation (when nm
is x
and airports
is y
), we can, for example, subset those counties that intersect with at least one airport using the expression nm[airports,]
. As noted earlier, using [
with two vector layers is in fact a shortcut used to retain those features of x
that intersect with y
. In this case, an equivalent over
expression would be nm[!is.na(over(nm,airports)$name),]
, but using nm[airports,]
is obviously more convenient.
We can plot the nm[airports,]
subset to demonstrate the behavior:
> plot(nm[airports, ]) > plot(airports, add = TRUE, col = "red", pch = 16, cex = 1.5) > text(airports, airports$name, pos = 1)
The additional pos
parameter of the text
function controls the position of the text with respect to the point coordinates (the default behavior we previously witnessed is to place the text centered on the coordinates point itself, with pos=1
the text is placed below the point; see ?text
for all options). The following screenshot shows the graphical output that is produced as a result of the three function calls:
We can see that nm[airports,]
indeed consists of only those nm
features that intersect airports.
If we try to match the airports
attribute table entries to nm
(with over(nm,airports)
), only the first airport that intersects each county will be returned in cases where there are multiple matches. If we wish to preserve all matches, we need to specify returnList=TRUE
in the over
function call. However, the returned object will be a list
object since holding sets of elements with variable lengths requires a list
rather than a data.frame
object. Many of the more advanced uses of R that require such flexibility involve list
objects. However, the subject is beyond the scope of this book. More information on lists can be found in the official R's introduction document at http://cran.r-project.org/doc/manuals/r-release/R-intro.pdf and in most introductory books on R.
As an example of querying polygon-polygon relations, we will use another example involving three polygonal Shapefiles:
The datasets were downloaded from freely accessible online resources, either from the Office of National Statistics at https://geoportal.statistics.gov.uk/ (administrative areas) or OpenStreetMap (buildings and natural areas).
First, we will read the files into R and bring them to a common CRS. We will begin with the administrative areas layer, reading it from the disk and naming it boundary
:
> boundary = readOGR("C:\Data", "CTYUA_DEC_2013_EW_BFE")
Using the proj4string
function reveals the CRS of boundary—the British National Grid. Note that the resulting string is abbreviated in the following output since it does not fit in a single line:
> proj4string(boundary) [1] "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 $
Next, we will read the two OpenStreetMap layers:
> buildings = readOGR("C:\Data", "london_buildings") > natural = readOGR("C:\Data", "london_natural")
Both these layers are defined in a geographical CRS, as the following output demonstrates:
> proj4string(buildings) [1] "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,$ > proj4string(natural) [1] "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,$
We will reproject the buildings
and natural
layers to the CRS of boundary
:
> buildings = spTransform(buildings, CRS(proj4string(boundary))) > natural = spTransform(natural, CRS(proj4string(boundary)))
Now that the preliminary preparations are complete, we can continue with the exercise. Our goal will be to assign the distance to the River Thames to each of the buildings within the City of London. For this, we will go through the following intermediate steps:
buildings
with only those in the City of London.natural
with only the riverbanks.buildings
and the riverbank.The administrative boundaries layer comes with an explanatory file (Product Specification for CTYUA_2013_EW_BFE.docx
, which is also provided on the book's website), where we can find out that the names of the different administrative areas are defined in the CTYUA13NM
column of the attribute table of this layer. Using this column, we will create a subset, named city
, containing only the polygon defining the city of the London administrative area:
> city = boundary[boundary$CTYUA13NM == "City of London", ]
Using the city
polygon, we will create a subset of only those buildings that are within the City of London. If we were interested in intersection (in other words, buildings that are completely or incompletely within the city
polygon), we could use the [
or over
methods as already shown earlier. To evaluate other types of relationships, the rgeos
package provides a dozen additional functions: gContains
, gContainsProperly
, gCovers
, gCoveredBy
, gCrosses
, gDisjoint
, gEquals
, gEqualsExact
, gIntersects
, gTouches
, gOverlaps
, and gWithin
. The names of these functions already provide a clue for the relationship they are used to evaluate; the exact definitions are provided in the respective help page of each function.
Since in our example we are interested in creating a subset of the buildings
features that are contained within city
, we are going to use the function that evaluates containment—gContains
. According to the help page of gContains
, the expression gContains(city,buildings)
will return TRUE
for buildings
features that are contained within city
. Specifying byid=TRUE
is necessary to evaluate the relation separately for each feature:
> in_city = gContains(city, buildings, byid = TRUE)
The result, in_city
, is a two-dimensional matrix with rows corresponding to the buildings
features (210,937 buildings in Greater London) and columns corresponding to the city
features (there is only one, the City of London; if we had more than one feature in city
, we would have had more columns):
> class(in_city) [1] "matrix" > head(in_city) 119 0 FALSE 1 FALSE 2 FALSE 3 FALSE 4 FALSE 5 FALSE
The first (and only) column of this matrix is a logical vector specifying the containment status of each building within the city
polygon. Using this vector as a rows index for buildings
creates a subset of only those buildings within city
:
> buildings = buildings[in_city[, 1], ]
The first step is now complete. Returning to the natural
layer, we will now subset the polygons of type "riverbank"
(which in the vicinity of the City of London corresponds to the River Thames):
> river = natural[natural$type == "riverbank", ]
As the third step, we will dissolve the separate river segments since we will be interested in the shortest distance to any section of the river, rather than the distances to its specific parts. As previously shown in the county
example, dissolving can be achieved with gUnaryUnion
. This time, the id
parameter is left unspecified so that all geometries will be dissolved into a single one:
> river = gUnaryUnion(river)
Before continuing with the fourth step (calculating distances, which we'll see later in this chapter), we will visually review the processed buildings
and river
layers we have at this point, with respect to the administrative boundaries layer boundary
:
> plot(buildings, col = "sandybrown") > plot(river, col = "lightblue", add = TRUE) > plot(boundary, border = "dimgrey", add = TRUE)
The resulting graphical output is shown in the following screenshot:
In this preceding screenshot, we see the buildings of the City of London (in brown), the dissolved riverbanks polygon (in blue), and the administrative areas boundaries (in gray).
The rgeos
package provides four functions to create new layers based on a pair of existing ones: gDifference
, gIntersection
, gSymdifference
, and gUnion
. The usage of these functions is very similar to that of functions to query relationships since their main parameters are also a pair of layers and the byid
parameter. The difference is that they do not return logical values or matched attribute table entries (based on whether the examined relationship holds), but rather a new layer. The following diagram demonstrates how new geometries are generated in each case:
Our next example will utilize two of these functions: gIntersection
and gDifference
. We will also use three new layers: buildings, natural areas, and administrative borders in Haifa. The buildings and natural areas layers originate from OpenStreetMap data, the same way as in the London example, while the administrative borders of Israel will be downloaded from a global administrative borders dataset directly through R. The buildings and natural areas layers will be named haifa_buildings
and haifa_natural
in order to not be confused with the analogous objects buildings
and natural
from the London example.
Our goal will be to create a polygon encompassing the natural areas in the vicinity of the buildings in Haifa, excluding those natural areas that are within 50 meters of the nearest building. We will follow four intermediate steps:
As a first step, we will read the Haifa buildings and natural areas layers:
> haifa_buildings = readOGR("C:\Data", "haifa_buildings") > haifa_natural = readOGR("C:\Data", "haifa_natural")
The third layer involved, the administrative boundaries, will be downloaded from the GADM database of Global Administrative Areas at http://www.gadm.org/, which is accessible using the getData
function from the raster
package:
> israel_adm = getData("GADM", country = "ISR", level = 1)
The hereby used arguments of getData
are as follows:
name
: The dataset name (for example, "GADM"
, which stands for the GADM dataset; using another dataset, "SRTM"
, will be demonstrated in the next chapter)country
(relevant for the name="GADM"
option): The country ISO3 code (a list of country codes can be obtained with getData("ISO3")
)level
(relevant for the name="GADM"
option): The level of administrative subdivision (0
: country, 1
: first subdivision, and so on)In fact, the county
layer we used earlier comes from the GADM dataset as well (only that FIPS codes have been added to its attribute table). The reason GADM was not used in the London example is that it is less accurate than the Office of National Statistics layer.
We will not need the whole israel_adm
layer, but only a subset consisting of the Haifa administrative area, which includes the city of Haifa:
> haifa_adm = israel_adm[israel_adm$NAME_1 == "Haifa", ]
Before proceeding with geometrical calculations, as usual, all three layers (haifa_adm
, haifa_buildings
, and haifa_natural
) will be reprojected to the same CRS. In this case, we are going to use the UTM Zone 36N CRS. We can obtain its parameters from the Landsat image object l_03
we read into R in one of the previous examples:
> haifa_adm = +spTransform(haifa_adm, CRS(proj4string(l_03))) > haifa_buildings = + spTransform(haifa_buildings, CRS(proj4string(l_03))) > haifa_natural = + spTransform(haifa_natural, CRS(proj4string(l_03)))
Having completed the first step, we will take a moment to plot the three layers and see what they look like:
> plot(haifa_natural, col = "lightgreen") > plot(haifa_buildings, add = TRUE) > plot(haifa_adm, add = TRUE)
The resulting graphical output (in the following screenshot) shows the Haifa administrative area border (haifa_adm
, in this case marking the Mediterranean sea coastline), the Haifa buildings (haifa_buildings
), and the natural areas (haifa_natural
, shown in green):
Proceeding with the second step, we will create a convex hull polygon, assigned to buildings_ch
, in order to define our area of interest surrounding the buildings. A convex hull is the smallest convex polygon encompassing a certain set of features (see the gray polygon in the next screenshot). A convex hull can be created using the gConvexHull
function:
> buildings_ch = gConvexHull(haifa_buildings)
The convex hull crosses the Mediterranean sea. We would like to, however, retain only those areas of buildings_ch
that are within haifa_adm
(in other words, on land). This can be achieved by using the gIntersection
function on buildings_ch
and haifa_adm
:
> buildings_ch = gIntersection(buildings_ch, haifa_adm)
Now that the bounding polygon buildings_ch
is set, we proceed to our third step. Turning to the haifa_natural
layer, we will merge all of its polygons into one polygon (since we are not interested in discerning different types of natural areas) using gUnaryUnion
, similarly to what we did in the London example:
> haifa_natural = gUnaryUnion(haifa_natural)
Then, we will use buildings_ch
to retain only those natural areas that are within our area of interest, using another gIntersection
function call:
> haifa_natural = gIntersection(haifa_natural, buildings_ch)
What remains to be done is our fourth step, which is removing the areas in haifa_natural
that are within 50 meters of the nearest building. To do this, we will first create a 50 meter buffer polygon surrounding haifa_buildings
, using the gBuffer
function (specifying the buffer size with width
):
> buildings_50m = gBuffer(haifa_buildings, width = 50)
Then, using the gDifference
function, we will calculate the area in haifa_natural
that is not within the 50 meters buffer of haifa_buildings
:
> haifa_natural = gDifference(haifa_natural, buildings_50m)
We are done. To see the resulting layers, we will plot all four of them (buildings_ch
, haifa_adm
, haifa_natural
, and haifa_buildings
) together, with the following series of plot
function calls:
> plot(buildings_ch, col = "lightgrey", border = "lightgrey") > plot(haifa_adm, add = TRUE) > plot(haifa_natural, col = "lightgreen", add = TRUE) > plot(haifa_buildings, add = TRUE)
The resulting graphical output is shown in the following screenshot:
In the preceding screenshot, we can see the bounding polygon buildings_ch
in gray, the administrative borders, as well as the buildings, in black, and natural areas (those within the bounding polygon and excluding areas within 50 meters of buildings) in green. We will continue working with the Haifa layers we have hereby created in several additional examples in subsequent chapters.
Let's now return to the London example, to complete its fourth step (which is distance calculation). Distances between each feature from one layer to each feature in a second layer can be calculated with the gDistance
function, setting byid
to TRUE
. The following expression calculates the distance from each feature in buildings
to each feature in river
:
> dist = gDistance(buildings, river, byid = TRUE)
As seen in the gContains
example, the result is a matrix:
> class(dist) [1] "matrix"
The matrix rows correspond to river
features, whereas its columns correspond to buildings
features. As the following output of dim
demonstrates, we have 1,583 buildings features in the City of London and one river
feature (remember that we dissolved the separate riverbank parts into one):
> dim(dist) [1] 1 1583
Selecting the first (and only) row of dist
will yield a numeric vector with the distances of each building to the nearest riverbank. With the following expression, we can assign the distances to a new column, named dist_river
, in the attribute table of buildings
:
> buildings$dist_river = dist[1, ]
Examining the attribute table will demonstrate that, indeed, we now have a distance-to-river entry for each of the buildings in the City of London:
> head(buildings@data) osm_id name type dist_river 16 4076420 St Brides place_of_worship 313.1239 56 4364085 Sainsbury's Head Office block 683.5640 137 4959489 30 St Mary Axe attraction 653.4159 138 4959510 Bank of England office 503.7244 139 4959544 St Paul's Cathedral cathedral 287.7122 140 4959629 Liverpool Street train_station 1009.8070
We are going to wait until Chapter 9, Advanced Visualization of Spatial Data, to graphically display this result while learning some additional visualization methods.