Elasticsearch leverages Lucene span queries, which allow us to make queries when some tokens or phrases are near other tokens or phrases. Basically, we can call them position aware queries. When using the standard non span queries, we are not able to make queries that are position aware; to some extent, the phrase
queries allow that, but only to some extent. So, for Elasticsearch and the underlying Lucene, it doesn't matter if the term is in the beginning of the sentence or at the end or near another term. When using span queries, it does matter.
The following span queries are exposed in Elasticsearch:
span term
queryspan first
queryspan near
queryspan or
queryspan not
queryspan within
queryspan containing
queryspan multi
queryBefore we continue with the description, let's index a document to a completely new index that we will use to show how span queries work. To do this, we use the following command:
curl -XPUT 'localhost:9200/spans/book/1' -d '{ "title" : "Test book", "author" : "Test author", "description" : "The world breaks everyone, and afterward, some are strong at the broken places" }'
A span, in our context, is a starting and ending token position in a field. For example, in our case, the world breaks everyone
could be a single span, a world
can be a single span too. As you may know, during analysis, Lucene, in addition to token, includes some additional parameters, such as position in the token stream. Position information combined with the terms allows us to construct spans using Elasticsearch span queries (which are mapped to Lucene span queries). In the next few pages, we will learn how to construct spans using different span queries and how to control which documents are matched.
The span_term
query is a builder for the other span queries. A span_term
query is a query similar to the already discussed term
query. On its own, it works just like the mentioned term query – it matches a term. Its definition is simple and looks as follows (we omitted some parts of the queries on purpose, because we will discuss it later):
{ "query" : { ... "span_term" : { "description" : { "value" : "world", "boost" : 5.0 } } } }
As you can see, it is very similar to the standard term query. The above query is run against the description field and we want to have the documents that have the world
term returned. We also specified the boost, which is also allowed.
One thing to remember is that the span_term
query, similar to the standard term query, is not analyzed.
The span first
query allows us to match documents that have matches only in the first positions of the field. In order to define a span first query, we need to nest inside of it any other span query; for example, a span term query we already know. So, let's find the document that has the term world
in the first two positions in the description
field. We do that by sending the following query:
{ "query" : { "span_first" : { "match" : { "span_term" : { "description" : "world" } }, "end" : 2 } } }
In the results, we will get the document that we had indexed in the beginning of this section. In the match
section of the span first query, we should include at least a single span query that should be matched at the maximum position specified by the end
parameter.
So, to understand everything well, if we set the end
parameter to 1
, we shouldn't get our document with the previous query. So, let's check it by sending the following query:
{ "query" : { "span_first" : { "match" : { "span_term" : { "description" : "world" } }, "end" : 1 } } }
The response to the preceding query will be as follows:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } }
So it is working as expected. This is because the first term in our index will be the term the
and not the term world
which we searched for.
The span near
query allows us to match documents that have other spans near each other and we can call this query a compound
query as it wraps another span query. For example, if we want to find documents that have the term world
near the term everyone
, we will run the following query:
{ "query" : { "span_near" : { "clauses" : [ { "span_term" : { "description" : "world" } }, { "span_term" : { "description" : "everyone" } } ], "slop" : 0, "in_order" : true } } }
As you can see, we specify our queries in the clauses
section of the span near
query. It is an array of other span queries. The slop
parameter defines the allowed number of terms between the spans. The in_order
parameter can be used to limit the matches only to those documents that match our queries in the same order that they were defined in. So, in our case, we will get documents that have world everyone
, but not everyone world
in the description field.
So let's get back to our query, right now it would return 0
results. If you look at our example document, you will notice that between the terms world and everyone, an additional term is present and we set the slop parameter to 0
(slop was discussed during the phrase
query description). If we increase it to 1,
we will get our result. To test it, let's send the following query:
{ "query" : { "span_near" : { "clauses" : [ { "span_term" : { "description" : "world" } }, { "span_term" : { "description" : "everyone" } } ], "slop" : 1, "in_order" : true } } }
The results returned by Elasticsearch are as follows:
{ "took" : 6, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 0.10848885, "hits" : [ { "_index" : "spans", "_type" : "book", "_id" : "1", "_score" : 0.10848885, "_source" : { "title" : "Test book", "author" : "Test author", "description" : "The world breaks everyone, and afterward, some are strong at the broken places" } } ] } }
As we can see, the altered query successfully returned our indexed document.
The span or
query allows us to wrap other span queries and aggregate matches of all those that we've wrapped. Similar to the span_near
query, the span_or
query uses the array of clauses to specify other span queries. For example, if we want to get the documents that have the term world
in the first two positions of the description field, or the ones that have the term world
not further than a single position from the term everyone
, we will send the following query to Elasticsearch:
{ "query" : { "span_or" : { "clauses" : [ { "span_first" : { "match" : { "span_term" : { "description" : "world" } }, "end" : 2 } }, { "span_near" : { "clauses" : [ { "span_term" : { "description" : "world" } }, { "span_term" : { "description" : "everyone" } } ], "slop" : 1, "in_order" : true } } ] } } }
The result of the preceding query will return our indexed document.
The span not
query allows us to specify two sections of queries. The first is the include section which specifies which span queries should be matched and the second section is the exclude one which specifies the span queries which shouldn't be overlapping the first ones. To keep it simple, if a query from the exclude one matches the same span (or a part of it) as the query from the include section, such a document won't be returned as a match for such a span not
query. Each of these sections can contain multiple span queries.
So, to illustrate that query, let's make a query that will return all the documents that have the span constructed from a single term and which have the term breaks
in the description
field. Let's also exclude the documents that have a span which matches the terms world
and everyone
at the maximum of a single position from each other, when such a span overlaps the one defined in the first span query.
{ "query" : { "span_not" : { "include" : { "span_term" : { "description" : "breaks" } }, "exclude" : { "span_near" : { "clauses" : [ { "span_term" : { "description" : "world" } }, { "span_term" : { "description" : "everyone" } } ], "slop" : 1 } } } } }
The following is the result:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 0, "max_score" : null, "hits" : [ ] } }
As you would have noticed, the result of the query is as we would have expected. Our document wasn't found because the span
query from the exclude
section was overlapping the span from the include
section.
The span_within
query allows us to find documents that have a span enclosed in another span. We define two sections in the span_within
query: the little and the big. The little
section defines a span
query that needs to be enclosed by the span
query defined using the big
section.
For example, if we would like to find a document that has the term world
near the term breaks and those terms should be inside a span that is bound by the terms world
and afterward
not more than 10terms from each other, the query that does that will look as follows:
{ "query" : { "span_within" : { "little" : { "span_near" : { "clauses" : [ { "span_term" : { "description" : "world" } }, { "span_term" : { "description" : "breaks" } } ], "slop" : 0, "in_order" : false } }, "big" : { "span_near" : { "clauses" : [ { "span_term" : { "description" : "world" } }, { "span_term" : { "description" : "afterward" } } ], "slop" : 10, "in_order" : false } } } } }
The span_contaning
query can be seen as the opposite of the span_within
query we just discussed. It allows us to match spans that overlap other spans. Again, we use two sections with the span queries: the little and the big. The little
section defines a span query that needs to be enclosed by the span query defined using the big
section.
We can use the same example. If we would like to find a document that has the term world
near the term breaks
, and those terms should be inside a span that is bound by the terms world
and afterward
not more than 10 terms from each other, the query that does that will look as follows:
{ "query" : { "span_containing" : { "little" : { "span_near" : { "clauses" : [ { "span_term" : { "description" : "world" } }, { "span_term" : { "description" : "breaks" } } ], "slop" : 0, "in_order" : false } }, "big" : { "span_near" : { "clauses" : [ { "span_term" : { "description" : "world" } }, { "span_term" : { "description" : "afterward" } } ], "slop" : 10, "in_order" : false } } } } }
The last type of span query that Elasticsearch supports is the span_multi
query. It allows us to wrap any multi term query that we've discussed (the term
query, the range
query, the wildcard
query, the regex
query, the fuzzy
query, or the prefix
query) as a span
query.
For example, if we want to find documents that have the term starting with the prefix wor
in the first two positions in the description field, we can do that by sending the following query:
{ "query" : { "span_multi" : { "match" : { "prefix" : { "description" : { "value" : "wor" } } } } } }
There is one thing to remember – the multi term query that we want to use needs to be enclosed in the match section of the span_multi
query.
A few words at the end of discussing span queries. Remember that they are costlier when it comes to processing power, because not only do the terms have to be matched but also positions have to be calculated and checked. This means that Lucene and thus Elasticsearch will need more CPU cycles to calculate all the needed information to find matching documents. You can expect span queries to be slower than the queries that don't take positions into account.