Elasticsearch spatial capabilities

The search servers such as Elasticsearch are usually looked at from the perspective of full-text searching. Elasticsearch, because of its marketing as being part of ELK (Elasticsearch, Logstash, and Kibana), is also highly known for being able to handle large amount of time series data. However, this is only a part of the whole view. Sometimes both of the mentioned use cases are not enough. Imagine searching for local services. For the end user, the most important thing is the accuracy of the results. By accuracy, we not only mean the proper results of the full-text search, but also the results being as near as they can in terms of location. In several cases, this is the same as a text search on geographical names such as cities or streets, but in other cases we can find it very useful to be able to search on the basis of the geographical coordinates of our indexed documents. And this is also a functionality that Elasticsearch is capable of handling.

With the release of Elasticsearch 2.2, the geo_point type received a lot of changes, especially internally where all the optimizations were done. Prior to 2.2, the geo_point type was stored in the index as a two not analyzed string values and this changed. With the release of Elasticsearch 2.2, the geo_point type got all the great improvements from Apache Lucene library and is now more efficient.

Mapping preparation for spatial searches

In order to discuss the spatial search functionality, let's prepare an index with a list of cities. This will be a very simple index with one type named poi (which stands for the point of interest), the name of the city, and its coordinates. The mappings are as follows:

{
  "mappings" : {
    "poi" : {
      "properties" : {
        "name" : { "type" : "string" },
        "location" : { "type" : "geo_point" }
      }
    }
  }
}

Assuming that we put this definition into the mapping1.json file, we can create an index by running the following command:

curl -XPUT localhost:9200/map -d @mapping1.json

The only new thing in the preceding mappings is the geo_point type, which is used for the location field. By using it, we can store the geographical position of our city and use spatial-based functionalities.

Example data

Our example documents1.json file with documents looks as follows:

{ "index" : { "_index" : "map", "_type" : "poi", "_id" : 1 }}
{ "name" : "New York", "location" : "40.664167, -73.938611" }
{ "index" : { "_index" : "map", "_type" : "poi", "_id" : 2 }}
{ "name" : "London", "location" : [-0.1275, 51.507222] }
{ "index" : { "_index" : "map", "_type" : "poi", "_id" : 3 }}
{ "name" : "Moscow", "location" : { "lat" : 55.75, "lon" : 37.616667 }}
{ "index" : { "_index" : "map", "_type" : "poi", "_id" : 4 }}
{ "name" : "Sydney", "location" : "-33.859972, 151.211111" }
{ "index" : { "_index" : "map", "_type" : "poi", "_id" : 5 }}
{ "name" : "Lisbon", "location" : "eycs0p8ukc7v" }

In order to perform a bulk request, we added information about the index name, type, and unique identifiers of our documents; so, we can now easily import this data using the following command:

curl -XPOST localhost:9200/_bulk --data-binary @documents1.json

One thing that we should take a closer look at is the location field. We can use various notations for coordination. We can provide the latitude and longitude values as a string, as a pair of numbers, or as an object. Note that the string and array methods of providing the geographical location have different orders for the latitude and longitude parameters. The last record shows that there is also a possibility to give coordination as a Geohash value (the notation is described in detail at http://en.wikipedia.org/wiki/Geohash).

Additional geo_field properties

With the release of Elasticsearch 2.2, the number of parameters that the geo_point type can accept has been reduced and is as follows:

  • geohash: Boolean parameter telling Elasticsearch whether the .geohash field should be created. Defaults to false unless geohash_prefix is used.
  • geohash_precision: Maximum size of geohash and geohash_prefix.
  • geohash_prefix: Boolean parameter telling Elasticsearch to index the geohash and its prefixes. Defaults to false.
  • ignore_malformed: Boolean parameter telling Elasticsearch to ignore a badly written geo_field point instead of rejecting the whole document. Defaults to false, which means that the badly formatted geo_field data will result in an indexation error for the whole document.
  • lat_lon: Boolean parameter telling Elasticsearch to index the spatial data in two separate fields called .lat and .lon. Defaults to false.
  • precision_step: Parameter allowing control over how our numeric geographical points will be indexed.

Keep in mind that the geohash field related and lat_lon field related properties were not removed for backward-compatibility reasons. The users can still use them. However, the queries will not use them but will instead use the highly optimized data structure that is built during indexing by the geo_point type.

Sample queries

Now let's look at several examples of using coordinates and solving common requirements in modern applications that require geographical data searching along with full-text searching.

Note

If you are interested in all the geospatial queries that are available for Elasticsearch users, refer to the official documentation available at https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-queries.html.

Distance-based sorting

Let's start with a very common requirement: sorting the returned results by distance from a given point. In our example, we want to get all the cities and sort them by their distances from the capital of France, Paris. To do this, we send the following query to Elasticsearch:

curl -XGET localhost:9200/map/_search?pretty -d '{
  "query" : {
    "match_all" : {}
  },
  "sort" : [{
    "_geo_distance" : {
      "location" : "48.8567, 2.3508",
      "unit" : "km"
    }
  }]
}'

If you remember the Sorting data section from Chapter 4, Extending Your Querying Knowledge, you'll notice that the format is slightly different. We are using the _geo_distance key to indicate sorting by distance. We must give the base location (the location attribute, which holds the information of the location of Paris in our case), and we need to specify the units that can be used in the results. The available values are km and mi, which stand for kilometers and miles, respectively. The result of such a query will be as follows:

{
  "took" : 5,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 5,
    "max_score" : null,
    "hits" : [ {
      "_index" : "map",
      "_type" : "poi",
      "_id" : "2",
      "_score" : null,
      "_source" : {
        "name" : "London",
        "location" : [ -0.1275, 51.507222 ]
      },
      "sort" : [ 343.17487356850313 ]
    }, {
      "_index" : "map",
      "_type" : "poi",
      "_id" : "5",
      "_score" : null,
      "_source" : {
        "name" : "Lisbon",
        "location" : "eycs0p8ukc7v"
      },
      "sort" : [ 1452.9506736367805 ]
    }, {
      "_index" : "map",
      "_type" : "poi",
      "_id" : "3",
      "_score" : null,
      "_source" : {
        "name" : "Moscow",
        "location" : {
          "lat" : 55.75,
          "lon" : 37.616667
        }
      },
      "sort" : [ 2483.837565935267 ]
    }, {
      "_index" : "map",
      "_type" : "poi",
      "_id" : "1",
      "_score" : null,
      "_source" : {
        "name" : "New York",
        "location" : "40.664167, -73.938611"
      },
      "sort" : [ 5832.645958617513 ]
    }, {
      "_index" : "map",
      "_type" : "poi",
      "_id" : "4",
      "_score" : null,
      "_source" : {
        "name" : "Sydney",
        "location" : "-33.859972, 151.211111"
      },
      "sort" : [ 16978.094780773998 ]
    } ]
  }
}

As with the other examples of sorting, Elasticsearch shows information about the value used for sorting. Let's look at the highlighted record. As we can see, the distance between Paris and London is about 343 km, and if you check a traditional map, you will see that this is true.

Bounding box filtering

The next example that we want to show is narrowing down the results to a selected area that is bounded by a given rectangle. This is very handy if we want to show results on the map or when we allow a user to mark the map area for searching. You already read about filters in the Filtering your results section of Chapter 4, Extending Your Querying Knowledge, but there we didn't mention spatial filters. The following query shows how we can filter by using the bounding box:

curl -XGET localhost:9200/map/_search?pretty -d '{
  "query" : {
    "bool" : {
      "must" : { "match_all": {}},
      "filter" : {
        "geo_bounding_box" : {
          "location" : {
            "top_left" : "52.4796, -1.903",
            "bottom_right" : "48.8567, 2.3508"
          }
        }
      }
    }
  }
}'

In the preceding example, we selected a map fragment between Birmingham and Paris by providing the top-left and bottom-right corner coordinates. These two corners are enough to specify any rectangle we want, and Elasticsearch will do the rest of the calculation for us. The following screenshot shows the specified rectangle on the map:

Bounding box filtering

As we can see, the only city from our data that meets the criteria is London. So, let's check whether Elasticsearch knows this by running the preceding query. Let's now look at the returned results:

{
  "took" : 38,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "map",
      "_type" : "poi",
      "_id" : "2",
      "_score" : 1.0,
      "_source" : {
        "name" : "London",
        "location" : [ -0.1275, 51.507222 ]
      }
    } ]
  }
}

As you can see, again Elasticsearch agrees with the map.

Limiting the distance

The last example shows the next common requirement: limiting the results to the places that are located no further than the defined distance from a given point. For example, if we want to limit our results to all the cities within the 500km radius from Paris, we can use the following query:

curl -XGET localhost:9200/map/_search?pretty -d '{
  "query" : {
    "bool" : {
      "must" : { "match_all": {}},
      "filter" : {
        "geo_distance" : {
          "location" : "48.8567, 2.3508",
          "distance" : "500km"
        }
      }
    }
  }
}'

If everything goes well, Elasticsearch should only return a single record for the preceding query, and the record should be London again. However, we will leave it for you as a reader to check.

Arbitrary geo shapes

Sometimes, using a single geographical point or a single rectangle is just not enough. In such cases something more sophisticated is needed, and Elasticsearch addresses this by giving you the possibility to define shapes. In order to show you how we can leverage custom shape-limiting in Elasticsearch, we need to modify our index or create a new one and introduce the geo_shape type. Our new mapping looks as follows (we will use this to create an index called map2):

{
  "mappings" : {
    "poi" : {
      "properties" : {
        "name" : { "type" : "string", "index": "not_analyzed" },
        "location" : { "type" : "geo_shape" }
      }
    }
  }
}

Assuming we wrote the preceding mapping definition to the mapping2.json file, we can create an index by using the following command:

curl -XPUT localhost:9200/map2 -d @mapping2.json

Note

Elasticsearch allows us to set several attributes for the geo_shape type. The most commonly used is the precision parameter. During indexing, the shapes have to be converted to a set of terms. The more accuracy required, the more terms should be generated, which is directly reflected in the index size and performance. Precision can be defined in the following units: in, inch, yd, yard, mi, miles, km, kilometers, m, meters, cm, centimeters, or mm, millimeters. By default, the precision is set to 50m.

Next, let's change our example data to match our new index structure and create the documents2.json file with the following contents:

{ "index" : { "_index" : "map2", "_type" : "poi", "_id" : 1 }}
{ "name" : "New York", "location" : { "type": "point", "coordinates": [-73.938611, 40.664167] }}
{ "index" : { "_index" : "map2", "_type" : "poi", "_id" : 2 }}
{ "name" : "London", "location" : { "type": "point", "coordinates": [-0.1275, 51.507222] }}
{ "index" : { "_index" : "map2", "_type" : "poi", "_id" : 3 }}
{ "name" : "Moscow", "location" : { "type": "point", "coordinates": [ 37.616667, 55.75]}}
{ "index" : { "_index" : "map2", "_type" : "poi", "_id" : 4 }}
{ "name" : "Sydney", "location" : { "type": "point", "coordinates": [151.211111, -33.865143]}}
{ "index" : { "_index" : "map2", "_type" : "poi", "_id" : 5 }}
{ "name" : "Lisbon", "location" : { "type": "point", "coordinates": [-9.142685, 38.736946] }}

The structure of the field of the geo_shape type is different from geo_point. It is syntactically called GeoJSON (http://en.wikipedia.org/wiki/GeoJSON). It allows us to define various geographical types. Now it's time to index our data:

curl -XPOST localhost:9200/_bulk --data-binary @documents2.json

Let's sum up the types that we can use during querying, at least the ones that we think are the most useful ones.

Point

A point is defined by the table when the first element is the longitude and the second is the latitude. An example of such a shape is as follows:

{
  "type": "point",
  "coordinates": [-0.1275, 51.507222]
}

Envelope

An envelope defines a box given by the coordinates of the upper-left and bottom-right corners of the box. An example of such a shape is as follows:

{
  "type": "envelope",
  "coordinates": [[ -0.087890625, 51.50874245880332 ], [ 2.4169921875, 48.80686346108517 ]]
}

Polygon

A polygon defines a list of points that are connected to create our polygon. The first and the last point in the array must be the same so that the shape is closed. An example of such a shape is as follows:

{
  "type": "polygon",
  "coordinates": [[
    [-5.756836, 49.991408],
    [-7.250977, 55.124723],
    [1.845703, 51.500194],
    [-5.756836, 49.991408]
  ]]
}

If you look closely at the shape definition, you will find a supplementary level of tables. Thanks to this, you can define more than a single polygon. In such a case, the first polygon defines the base shape and the rest of the polygons are the shapes that will be excluded from the base shape.

Multipolygon

The multipolygon shape allows us to create a shape that consists of multiple polygons. An example of such a shape is as follows:

{
  "type": "multipolygon",
  "coordinates": [
    [[
       [-5.756836, 49.991408],
       [-7.250977, 55.124723],
       [1.845703, 51.500194],
       [-5.756836, 49.991408]
    ]], [[ 
       [-0.087890625, 51.50874245880332],
       [2.4169921875, 48.80686346108517],
       [3.88916015625, 51.01375465718826],
       [-0.087890625, 51.50874245880332]
    ]] ]
}

The multipolygon shape contains multiple polygons and falls into the same rules as the polygon type. So, we can have multiple polygons and, in addition to this, we can include multiple exclusion shapes.

An example usage

Now that we have our index with the geo_shape fields, we can check which cities are located in the UK. The query that will allow us to do this looks as follows:

curl -XGET localhost:9200/map2/_search?pretty -d '{
  "query" : {
    "bool" : {
      "must" : { "match_all": {}},
      "filter": {
        "geo_shape": {
          "location": {
            "shape": {
              "type": "polygon",
              "coordinates": [[
                [-5.756836, 49.991408], [-7.250977, 55.124723],
                [-3.955078, 59.352096], [1.845703, 51.500194],
                [-5.756836, 49.991408]
              ]]
            }
          }
        }
      }
    }
  }
}'

The polygon type defines the boundaries of the UK (in a very, very imprecise way), and Elasticsearch's response is as follows:

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "map2",
      "_type" : "poi",
      "_id" : "2",
      "_score" : 1.0,
      "_source" : {
        "name" : "London",
        "location" : {
          "type" : "point",
          "coordinates" : [ -0.1275, 51.507222 ]
        }
      }
    } ]
  }
}

As far as we know, the response is correct.

Storing shapes in the index

Usually, shape definitions are complex, and the defined areas don't change too often (for example, the boundaries of the UK). In such cases, it is convenient to define the shapes in the index and use them in queries. This is possible, and we will now discuss how to do it. As usual, we will start with the appropriate mapping, which is as follows:

{
  "mappings" : {
    "country": {
      "properties": {
        "name": { "type": "string", "index": "not_analyzed" },
        "area": { "type": "geo_shape" }
      }
    }
  }
}

This mapping is similar to the mapping used previously. We have only changed the field name and saved it in the mapping3.json file. Let's create a new index by running the following command:

curl -XPUT localhost:9200/countries -d @mapping3.json

The example data that we will use looks as follows (stored in the file called documents3.json):

{"index": { "_index": "countries", "_type": "country", "_id": 1 }}
{"name": "UK", "area": {"type": "polygon", "coordinates": [[ [-5.756836, 49.991408], [-7.250977, 55.124723], [-3.955078, 59.352096], [1.845703, 51.500194], [-5.756836, 49.991408] ]]}}
{"index": { "_index": "countries", "_type": "country", "_id": 2 }}
{"name": "France", "area": { "type":"polygon", "coordinates": [ [ [ 3.1640625, 42.09822241118974 ], [ -1.7578125, 43.32517767999296 ], [ -4.21875, 48.22467264956519 ], [ 2.4609375, 50.90303283111257 ], [ 7.998046875, 48.980216985374994 ], [ 7.470703125, 44.08758502824516 ], [ 3.1640625, 42.09822241118974 ] ] ] }}
{"index": { "_index": "countries", "_type": "country", "_id": 3 }}
{"name": "Spain", "area": { "type": "polygon", "coordinates": [ [ [ 3.33984375, 42.22851735620852 ], [ -1.845703125, 43.32517767999296 ], [ -9.404296875, 43.19716728250127 ], [ -6.6796875, 41.57436130598913 ], [ -7.3828125, 36.87962060502676 ], [ -2.109375, 36.52729481454624 ], [ 3.33984375, 42.22851735620852 ] ] ] }}

To index the data, we just need to run the following command:

curl -XPOST localhost:9200/_bulk --data-binary @documents3.json

As you can see in the data, each document contains a polygon type. The polygons define the area of the given countries (again, it is far from being accurate). If you remember, the first point of a shape needs to be the same as the last one so that the shape is closed. Now, let's change our query to include the shapes from the index. Our new query looks as follows:

curl -XGET localhost:9200/map2/_search?pretty -d '{
  "query" : {
    "bool" : {
      "must" : { "match_all": {}},
      "filter": {
        "geo_shape": {
          "location": {
            "indexed_shape": {
              "index": "countries",
              "type": "country",
              "path": "area",
              "id": "1"
            }
          }
        }
      }
    }
  }
}'

When comparing these two queries, we can note that the shape object changed to indexed_shape. We need to tell Elasticsearch where to look for this shape. We can do this by defining the index (the index property, which defaults to shape), the type (the type property), and the path (the path property, which defaults to shape). The one item lacking is an id property of the shape. In our case, this is 1. However, if you want to index more shapes, we advise you to index the shapes with their name as their identifier.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset