Using span queries

ElasticSearch leverages the Lucene span queries, which basically allow us to create queries that match when some tokens or phrases are placed near other tokens or phrases. When using the standard non-span queries, we are not able to make queries that are position aware—to some extent phrase queries allow that, but only to some extent.

There are five span queries exposed in ElasticSearch:

  • Span term query
  • Span first query
  • Span near query
  • Span or query
  • Span not query

Before we continue with the description, let us index a new document that we will be using in order to show how span queries work. To do that, we send the following command to ElasticSearch:

curl -XPOST 'localhost:9200/library/book/5' -d '{
  "title" : "Test book",
  "author" : "Test author",
  "description" : "The world breaks everyone, and afterward, 
  some are strong at the broken places"
}'

As you can see, we used ElasticSearch's ability to update our index structure automatically and we've added the description field. We did that to have a field that has more content than a book's title usually has.

What is a span?

A span, in our context, is a starting and ending token position in a field. For example, in our case, world breaks everyone can be a single span, world can be a single span too. As you may know, during analysis, Lucene, in addition to token, includes some additional parameters—such as distance from the previous token. Position information combined with terms allows us to construct spans, using the ElasticSearch span queries (which are mapped to Lucene span queries). In the next few sections, we will learn how to construct spans by using different span queries and how to control which documents are matched.

Span term query

A span term query is a query similar to the already-discussed term query. On its own, it works just like the mentioned term query—it matches a term. Its definition is simple and looks as follows (I omitted some parts of the query on purpose, because we will discuss it in a few lines of text):

{
  "query" : {
    ...
    {
      "span_term" : {
        "description" : {
          "value" : "world",
          "boost" : 5.0
        }
      }
    }
  }
}

As you can see, this query is very similar to a term query. The preceding query is run against the description field and we want to have the documents that have the world term returned. We also specified a boost, which is also allowed. Of course, similar to a term query, we can use a simplified version if we don't want to use boosts:

{
  "query" : {
    ...
    {
      "span_term" : {
        "description" : "world"
      }
    }
  }
}

Span first query

The span first query allows matching only documents that have matched in the first positions of the field. In order to define a span first query, we need to nest any other span queries inside it, for example, a span term query that we already know. So, let's find documents that have the term world in the first two positions in the description field. We do that by sending the following query:

{
  "query" : {
    "span_first" : {
      "match" : {
        "span_term" : { "description" : "world" }
      },
      "end" : 2
    }
  }
}

In the results, we will get our document that we indexed at the beginning of this chapter. In the match section of the span first query, we should include at least a single span query that should be matched at the maximum position specified by the end parameter.

So, if we understand everything well and if we set the end parameter to 1, we will not get our document with the preceding query. So let's check it by sending the following query:

{
  "query" : {
    "span_first" : {
      "match" : {
        "span_term" : { "description" : "world" }
      },
      "end" : 1
    }
  }    
}

The response to the preceding query will be as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

So it's working as expected, hurrah!

Span near query

The span near query allows us to match documents that have other spans near each other and this is also a compound query that wraps other span queries. For example, if we want to find documents that have the term world near the term everyone, we can run the following query:

{
  "query" : {
    "span_near" : {
      "clauses" : [
        { "span_term" : { "description" : "world" } },
        { "span_term" : { "description" : "everyone" } }
      ],
      "slop" : 0,
      "in_order" : true
    }
  }
}

As you can see, we specified our queries in the clauses section of the span near query. It is an array of other span queries. The slop parameter specified in the preceding query is similar to the one used in the phrase queries—it allows us to control the number of allowed terms between spans. The in_order parameter can be used to limit the matches only to those documents that match our queries in the same order they were defined, so in our case, we will get documents that have world everyone, but not everyone world in the description field.

So let's get back to our query—that will return 0 results. If you look at our example document, you will notice that between the terms world and everyone, an additional term is present. We have set the slop parameter to 0 (slop was discussed during description of the phrase query in Chapter 2, Searching Your Data). If we increase it to 1, we will get our result. To test it, let's send the following query:

{
  "query" : {
    "span_near" : {
      "clauses" : [
        { "span_term" : { "description" : "world" } },
        { "span_term" : { "description" : "everyone" } }
      ],
      "slop" : 1,
      "in_order" : true
    }
  }
}

And the results returned by ElasticSearch are as follows:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.095891505,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "5",
      "_score" : 0.095891505, "_source" : {"title" : "Test book", 
      "author" : "Test author","description" : 
      "The world breaks everyone, and afterward, 
      some are strong at the broken places" }
    } ]
  }
}

As you can see, it works!

Span or query

The span or query allows us to wrap other span queries and aggregate matches of all those that we've wrapped—it makes a union of span queries. This also uses the clauses array to specify other span queries for which matches should be aggregated. For example, if we want to get the documents that have the world term in the first two positions of the description field or the ones that have the term world not further than a single position from the term everyone, we will send the following query:

{
  "query" : {
    "span_or" : {
      "clauses" : [
        {
          "span_first" : {
            "match" : {
              "span_term" : { "description" : "world" }
            },
            "end" : 2
          }
        },
        {
          "span_near" : {
            "clauses" : [
               { "span_term" : { "description" : "world" } },
               { "span_term" : { "description" : "everyone" } }
            ],
            "slop" : 1,
            "in_order" : true
          }
        }
      ]
    }
  }
}

The result of the preceding query should be our example document that we indexed at the beginning:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.16608895,
    "hits" : [ {
      "_index" : "library",
      "_type" : "book",
      "_id" : "5",
      "_score" : 0.16608895, "_source" : {"title" : "Test book", 
      "author" : "Test author","description" : 
      "The world breaks everyone, and afterward, 
      some are strong at the broken places" }
    } ]
  }
}

Span not query

The last type of span queries, the span not query, allows us to specify two sections of queries. The first is the include section, which specifies which span queries should be matched and the second section, exclude, specifies span queries that shouldn't overlap with the first ones. To keep it simple, if a query from the exclude section matches the same span (or a part of it) as a query from the include section, such a document won't be returned as a match for that span not query. Each of those sections can contain multiple span queries.

So to illustrate that query, let's create a query that will return all the documents that have the span constructed from a single term breaks in the description field. And let's exclude the documents that have a span that matches the world and everyone terms that are at a maximum of a single position from each other, when such a span overlaps the one defined in the first span query.

{
  "query" : {
    "span_not" : {
      "include" : {
        "span_term" : { "description" : "breaks" }
      },
      "exclude" : {
        "span_near" : {
            "clauses" : [
              { "span_term" : { "description" : "world" } },
              { "span_term" : { "description" : "everyone" } }
            ],
            "slop" : 1
        }
      }
    }
  }
}

And the result is as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

As you may have noticed, the result of the query is as we had expected—our document wasn't found because the span query from the exclude section was overlapping the span from the include section.

Performance considerations

A few words at the end of the discussion about span queries—remember that they are more costly when it comes to processing power, because not only do terms have to be matched, but also positions have to be calculated and checked.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset