Using span queries

Elasticsearch leverages Lucene span queries, which allow us to make queries when some tokens or phrases are near other tokens or phrases. Basically, we can call them position aware queries. When using the standard non span queries, we are not able to make queries that are position aware; to some extent, the phrase queries allow that, but only to some extent. So, for Elasticsearch and the underlying Lucene, it doesn't matter if the term is in the beginning of the sentence or at the end or near another term. When using span queries, it does matter.

The following span queries are exposed in Elasticsearch:

  • span term query
  • span first query
  • span near query
  • span or query
  • span not query
  • span within query
  • span containing query
  • span multi query

Before we continue with the description, let's index a document to a completely new index that we will use to show how span queries work. To do this, we use the following command:

curl -XPUT 'localhost:9200/spans/book/1' -d '{
 "title" : "Test book",
 "author" : "Test author",
 "description" : "The world breaks everyone, and afterward, some are strong at the broken places"
}'

A span

A span, in our context, is a starting and ending token position in a field. For example, in our case, the world breaks everyone could be a single span, a world can be a single span too. As you may know, during analysis, Lucene, in addition to token, includes some additional parameters, such as position in the token stream. Position information combined with the terms allows us to construct spans using Elasticsearch span queries (which are mapped to Lucene span queries). In the next few pages, we will learn how to construct spans using different span queries and how to control which documents are matched.

Span term query

The span_term query is a builder for the other span queries. A span_term query is a query similar to the already discussed term query. On its own, it works just like the mentioned term query – it matches a term. Its definition is simple and looks as follows (we omitted some parts of the queries on purpose, because we will discuss it later):

{
  "query" : {
 ...
    "span_term" : { 
    "description" : { 
     "value" : "world", 
     "boost" : 5.0 
    }
   }
  }
}

As you can see, it is very similar to the standard term query. The above query is run against the description field and we want to have the documents that have the world term returned. We also specified the boost, which is also allowed.

One thing to remember is that the span_term query, similar to the standard term query, is not analyzed.

Span first query

The span first query allows us to match documents that have matches only in the first positions of the field. In order to define a span first query, we need to nest inside of it any other span query; for example, a span term query we already know. So, let's find the document that has the term world in the first two positions in the description field. We do that by sending the following query:

{
 "query" : {
  "span_first" : {
   "match" : {
    "span_term" : { "description" : "world" }
   },
   "end" : 2
  }
 }
}

In the results, we will get the document that we had indexed in the beginning of this section. In the match section of the span first query, we should include at least a single span query that should be matched at the maximum position specified by the end parameter.

So, to understand everything well, if we set the end parameter to 1, we shouldn't get our document with the previous query. So, let's check it by sending the following query:

{
 "query" : {
  "span_first" : {
   "match" : {
    "span_term" : { "description" : "world" }
   },
   "end" : 1
  }
 }
}

The response to the preceding query will be as follows:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

So it is working as expected. This is because the first term in our index will be the term the and not the term world which we searched for.

Span near query

The span near query allows us to match documents that have other spans near each other and we can call this query a compound query as it wraps another span query. For example, if we want to find documents that have the term world near the term everyone, we will run the following query:

{
 "query" : {
  "span_near" : {
   "clauses" : [
    { "span_term" : { "description" : "world" } },
    { "span_term" : { "description" : "everyone" } }
   ],
   "slop" : 0,
   "in_order" : true
  }
 }
}

As you can see, we specify our queries in the clauses section of the span near query. It is an array of other span queries. The slop parameter defines the allowed number of terms between the spans. The in_order parameter can be used to limit the matches only to those documents that match our queries in the same order that they were defined in. So, in our case, we will get documents that have world everyone, but not everyone world in the description field.

So let's get back to our query, right now it would return 0 results. If you look at our example document, you will notice that between the terms world and everyone, an additional term is present and we set the slop parameter to 0 (slop was discussed during the phrase query description). If we increase it to 1, we will get our result. To test it, let's send the following query:

{
 "query" : {
  "span_near" : {
   "clauses" : [
    { "span_term" : { "description" : "world" } },
    { "span_term" : { "description" : "everyone" } }
   ],
   "slop" : 1,
   "in_order" : true
  }
 }
}

The results returned by Elasticsearch are as follows:

{
  "took" : 6,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.10848885,
    "hits" : [ {
      "_index" : "spans",
      "_type" : "book",
      "_id" : "1",
      "_score" : 0.10848885,
      "_source" : {
        "title" : "Test book",
        "author" : "Test author",
        "description" : "The world breaks everyone, and afterward, some are strong at the broken places"
      }
    } ]
  }
}

As we can see, the altered query successfully returned our indexed document.

Span or query

The span or query allows us to wrap other span queries and aggregate matches of all those that we've wrapped. Similar to the span_near query, the span_or query uses the array of clauses to specify other span queries. For example, if we want to get the documents that have the term world in the first two positions of the description field, or the ones that have the term world not further than a single position from the term everyone, we will send the following query to Elasticsearch:

{
 "query" : {
  "span_or" : {
   "clauses" : [
    {
     "span_first" : {
      "match" : {
       "span_term" : { "description" : "world" }
      },
      "end" : 2
     }
    },
    {
     "span_near" : {
      "clauses" : [
       { "span_term" : { "description" : "world" } },
       { "span_term" : { "description" : "everyone" } }
      ],
      "slop" : 1,
      "in_order" : true
     }
    }
   ]
  }
 }
}

The result of the preceding query will return our indexed document.

Span not query

The span not query allows us to specify two sections of queries. The first is the include section which specifies which span queries should be matched and the second section is the exclude one which specifies the span queries which shouldn't be overlapping the first ones. To keep it simple, if a query from the exclude one matches the same span (or a part of it) as the query from the include section, such a document won't be returned as a match for such a span not query. Each of these sections can contain multiple span queries.

So, to illustrate that query, let's make a query that will return all the documents that have the span constructed from a single term and which have the term breaks in the description field. Let's also exclude the documents that have a span which matches the terms world and everyone at the maximum of a single position from each other, when such a span overlaps the one defined in the first span query.

{
  "query" : {
  "span_not" : {
   "include" : {
    "span_term" : { "description" : "breaks" }
   },
   "exclude" : {
    "span_near" : {
      "clauses" : [
       { "span_term" : { "description" : "world" } },
       { "span_term" : { "description" : "everyone" } }
      ],
      "slop" : 1
     }
   }
  }
 }
}

The following is the result:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : null,
    "hits" : [ ]
  }
}

As you would have noticed, the result of the query is as we would have expected. Our document wasn't found because the span query from the exclude section was overlapping the span from the include section.

Span within query

The span_within query allows us to find documents that have a span enclosed in another span. We define two sections in the span_within query: the little and the big. The little section defines a span query that needs to be enclosed by the span query defined using the big section.

For example, if we would like to find a document that has the term world near the term breaks and those terms should be inside a span that is bound by the terms world and afterward not more than 10terms from each other, the query that does that will look as follows:

{
 "query" : {
  "span_within" : {
   "little" : {
    "span_near" : {
     "clauses" : [
      { "span_term" : { "description" : "world" } },
      { "span_term" : { "description" : "breaks" } }
     ],
     "slop" : 0,
     "in_order" : false
    }
   },
   "big" : {
    "span_near" : {
     "clauses" : [
      { "span_term" : { "description" : "world" } },
      { "span_term" : { "description" : "afterward" } }
     ],
     "slop" : 10,
     "in_order" : false
    }
   }
  }
 }
}

Span containing query

The span_contaning query can be seen as the opposite of the span_within query we just discussed. It allows us to match spans that overlap other spans. Again, we use two sections with the span queries: the little and the big. The little section defines a span query that needs to be enclosed by the span query defined using the big section.

We can use the same example. If we would like to find a document that has the term world near the term breaks, and those terms should be inside a span that is bound by the terms world and afterward not more than 10 terms from each other, the query that does that will look as follows:

{
 "query" : {
  "span_containing" : {
   "little" : {
    "span_near" : { 
     "clauses" : [
      { "span_term" : { "description" : "world" } },
      { "span_term" : { "description" : "breaks" } }
     ],
     "slop" : 0,
     "in_order" : false
    }
   },
   "big" : {
    "span_near" : {
     "clauses" : [
      { "span_term" : { "description" : "world" } },
      { "span_term" : { "description" : "afterward" } }
     ],
     "slop" : 10,
     "in_order" : false
    }
   }
  }
 }
}

Span multi query

The last type of span query that Elasticsearch supports is the span_multi query. It allows us to wrap any multi term query that we've discussed (the term query, the range query, the wildcard query, the regex query, the fuzzy query, or the prefix query) as a span query.

For example, if we want to find documents that have the term starting with the prefix wor in the first two positions in the description field, we can do that by sending the following query:

{
 "query" : {
  "span_multi" : { 
   "match" : {
    "prefix" : {
     "description" : { "value" : "wor" }
    }
   }
  }
 }
}

There is one thing to remember – the multi term query that we want to use needs to be enclosed in the match section of the span_multi query.

Performance considerations

A few words at the end of discussing span queries. Remember that they are costlier when it comes to processing power, because not only do the terms have to be matched but also positions have to be calculated and checked. This means that Lucene and thus Elasticsearch will need more CPU cycles to calculate all the needed information to find matching documents. You can expect span queries to be slower than the queries that don't take positions into account.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset