Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

7. Full-Text Search

Dejan Miličić¹

(1)

Novi Sad, Serbia

We already mentioned the full-text search capabilities of RavenDB in Chapter 3, and this chapter will expand on the concepts introduced there. We will show basic search capabilities, including operators, wildcards, and ranking. You will learn what lies under the hood and how RavenDB indexes process text internally to provide all these capabilities. Finally, we will demonstrate how you can take more control over the indexing process and apply advanced techniques with static indexes.

Basics of Full-Text Search

Looking at standard features in modern applications, you will quickly realize that ability to perform a full-text search is high on this list. Almost every application needs it. With a large amount of data, the ability to search becomes crucial; information that is not readily retrievable is essentially unusable.

In previous chapters, we saw how to perform filtering based on the exact match – you would specify the property name and value, and the database would return one or more documents where a property has that value.

The beauty of full-text search is partial matching – you can search for a particular term that is part of a text. For example, you can locate all books with a title containing “London” or all products with “chocolate” in the name. You can also specify prefixes or suffixes and get all matching documents. It is possible to pass multiple terms, and the database will return a union of documents containing any of those terms.

Let’s look at various ways you can search text with RavenDB.

Single Term

As you can see in Figure 7-1, sample database products’ names consist of one or more words .

You can search for tofu products:

from "Products"

where search(Name, 'Tofu')

This query will return products with the names Tofu and Longlife Tofu. A term you search for can be at the beginning, the end, or anywhere inside the product name; RavenDB will easily match any position.

Note that the following query will produce identical results.

from "Products"

where search(Name, 'tofu')

With default settings, the full-text search feature in RavenDB is case insensitive.

Before performing a search, RavenDB will normalize the term by removing special and interpunction characters. As a result, all of these terms will produce identical results:

“tofu”
“ tofu “
“-tofu,”

Multiple Terms

You can search for multiple terms :

from "Products"

where search(Name, 'tofu vegie')

This query will produce results you can see in Figure 7-2.

Figure 7-2
Search Results for the Terms “tofu vegie”

This query will return all products with “tofu” or “vegie” in the name. You can expand this further, adding additional terms:

from "Products"

where search(Name, 'tofu vegie chocolade')

thus, including products containing any of these terms in the Name property .

Searching over Complex Objects

RavenDB can search not just over simple text fields but also over compound ones. The structure of an employee address is JSON, like the one in Listing 7-1.

"Address": {

"Line1": "4726 - 11th Ave. N.E.",

"Line2": null,

"City": "Seattle",

"Region": "WA",

"PostalCode": "98105",

"Country": "USA",

"Location": {

"Latitude": 47.66416419999999,

"Longitude": -122.3160148

}

Listing 7-1

A nested structure of Address property of Employee document

You can search by Employee’s address property to get everyone who is living in Seattle:

from Employees where search(Address, "Seattle")

In this case, RavenDB correctly processed complex nested structures by flattening them and indexing all properties from various levels.

Searching over collections is supported as well. Orders in our sample database contain a collection of order lines. You can search for all orders containing fried products with the following query:

from Orders

where search(Lines.ProductName, "Fried")

Wildcards

Partial matching of full-text search provides a way to specify a word for searching within the text. Wildcards can increase the power of partial searching – you can replace one or more letters in situations where the beginning or end of a search term is unknown.

Hence, instead of searching for all employees named Anne

from "Employees"

where search(FirstName, 'Anne')

you can use a wildcard to specify just a portion of the search term. The following query will search for any first names with An prefix .

from "Employees"

where search(FirstName, 'An*')

Executing this query will return Andrew and Anne.

Accordingly, you can perform a suffix search:

from "Employees"

where search(LastName, '*an')

This query will return Steven Buchanan and Laura Callahan from our sample dataset.

Finally, combining the previous two approaches, we can execute an infix search :

from "Employees"

where search(FirstName, '*an*')

This infix full-text search will return Anne, Nancy, Andrew, and Janet.

When using wildcards, one more critical factor to consider is performance. Using a leading wildcard drastically slows down searching. Of course, this slowdown will not be significant on small datasets (like the current sample dataset we are using). Still, it may become a factor as the number of documents in your database increases. Hence, you should bear that in mind and evaluate every suffix full-text search scenario, both for justification and for potential negative impact .

If you nevertheless determine that your application needs this type of searching, there are a couple of alternatives:

Create a static index where you would index reversed text, thus transforming leading wildcard searches (search by suffix) into trailing wildcard searches (search by prefix).
Create a static index with a non-default analyzer.

Later in this chapter, we will cover the second technique .

Suggestions

Sometimes, the search will return no results. For example, none of the products contain the word “chaig” in the name, which can be verified by executing the following query:

from Products

where search('Name', 'chaig')

In such situations, RavenDB offers suggest feature. You can use it by calling suggest() function, like in Listing 7-2.

from Products

select suggest('Name', 'chaig')

Listing 7-2

Selecting suggestions

Executing this query will return suggestions shown in Figure 7-3.

Figure 7-3
Suggestions for the Word “chaig”

Suggest function will find words similar to the term passed, based on the calculated distance algorithm value. This is handy if you want to implement Google-style “Did you mean?” suggestions.

Operators

We already showed that you could use more than one term for a full-text search, and you will get all documents satisfying one or the other. Essentially, RavenDB will implicitly apply the or logical operator. This means that query

from Employees where search(Address, "Seattle London")

is equivalent to

from Employees where search(Address, "Seattle London", or)

and will return all employees living either in Seattle or London.

Applying and operator

from Employees where search(Address, "Seattle London", and)

will return no results since there is no Employee with both cites in the address.

You can combine these operators further :

from 'Employees'

where search(Notes, "Spanish Portuguese", and)

or search(Notes, "Manager")

This query will return employees who either are managers or speak both Spanish and Portuguese languages.

What Happens Inside?

Looking at various ways to perform partial searching demonstrated in the previous section may lead you to think many complex things are going on. And even though the full-text search is not a simple mechanism, most of the inner implementation concepts are relatively simple. As we already saw with filtering, ordering, and aggregations - it all comes down to using appropriate data structures. Next, to avoid computation during query time, which is always dangerous since the query’s execution time depends on the dataset’s size, RavenDb will prepare index entries upfront. As a result, full-text search is efficient and fast. A certain amount of work is imminent, but doing it at most once, ahead of time, and storing the outcome for reuse is the key to performant queries.

Text Analysis

During indexing , RavenDB will use an analyzer to separate text into segments. These segments are called tokens, and they will form index terms. Later, when you perform a full-text search, your search term is matched against tokens contained in an index. Hence, this approach will transform a partial match of a search term against text into an exact match against the collection of tokens produced during the analysis of the text. Of course, this statement simplifies the whole process but conceptually describes the matching mechanism well.

Primarily, the analyzer performs tokenization of the text. Tokenization breaks text into lexical units, also called tokens. Tokens are the shortest searchable units, and the tokenization process converts input text into a token stream.

Additionally , tokens produced by a tokenizer are passed through one or more filters. Filters will examine the token stream and may leave tokens intact, modify them, discard them, or even create new ones. Alterations may include normalizing characters to all lower case or a version without diacritics. Punctuation and stop words like the and is are often removed, as are other unhelpful tokens that might impact the search quality.

Tokenizers and filters are usually combined into a pipeline (also called chain sometimes) where the output of one is input for another. The analyzer is simply a term for a sequence of tokenizers and filters that take text as input and produce a set of tokens.

Standard Analyzer

RavenDB comes with several analyzers, with Standard Analyzer as a default one. Standard Analyzer consists of Standard Tokenizer and two filters – LowercaseToken Filter and StopToken Filter.

Standard Tokenizer will perform segmentation of text by treating whitespaces, newlines, interpunction, and other special characters as token boundaries. Such generated stream of tokens is passed through LowercaseToken Filter, which will normalize them to all small letters. Finally, StopToken Filter will remove English stop words from the token stream. Examples are a, the, and is – these are so-called function words , which are ambiguous or have little lexical meaning in the context of full-text search.

As an example, the following sentence

A quick Fox jumps over the lazy Dog!

passed through Standard Analyzer will be transformed into a stream of tokens:

[quick], [fox], [jumps], [over], [lazy], [dog]

This process will remove exclamation marks and common words like a and the. Additionally, all tokens are lowercased.

As we have seen in previous chapters, running various queries will trigger RavenDB to create appropriate indexes to serve those queries efficiently. Likewise, RavenDB will create an automatic full-text search index when you run a full-text search query. Standard Analyzer will be applied to one or more searched fields. Their content will be tokenized, and a set of index entries will be generated.

An interesting fact is that Analyzer will not only be applied to input text during indexing but also to search terms. So, when executing a query

from "Employees"

where search(FirstName, 'Andrew Anne')

RavenDB will apply Standard Analyzer to string “Andrew Anne” transforming it into a token stream [andrew], [anne] and only then perform actual matching of these tokens to indexing terms. This rule has one exception – if a search term contains a wildcard, it will remain. An analyzer is not applied to search terms containing wildcards .

Besides Standard, RavenDB comes with additional analyzers:

Keyword Analyzer
LowerCase Whitespace Analyzer
NGram Analyzer
Simple Analyzer
Stop Analyzer
Whitespace Analyzer

You can use some of these to populate your full-text search index with differently shaped tokens. We will cover one such case later in this chapter.

Finally, RavenDB supports custom analyzers. Depending on the circumstances, you may have specific needs for tokenization and filtering. In such situations, you can write your custom analyzer and supply it to RavenDB. A typical example is the analysis of content in different languages. You can already guess that set of stop words is language-dependent. For example, the English word car is among the stop words in the French language. Custom analyzers are out of the scope of this book, but it is good to be aware of the highly customizable nature of RavenDB.

Ranking

For full-text search results to be helpful and for users of your application to be satisfied, most relevant results should be ranked at the top, followed by less relevant results. How does RavenDB determine relevancy?

Let’s look at some examples.

Running query

from Employees where search(Address, "Seattle Redmond")

select Address.City

you will see that two employees from Seattle are positioned first, followed by a single one from Redmond. If we change the order of cities to

from Employees where search(Address, "Redmond Seattle")

select Address.City

you will notice results are sorted in reverse order compared to the previous query, following the order of search terms – first employee from Redmond and then two from Seattle.

As you can observe, this ordering is not a random one. RavenDB orders search results based on relevancy, so users are more likely to see more relevant results at the top of the list. When searching for “Redmond Seattle,” RavenDB will consider the first search term more important than the second one, so employees from Redmond will be ranked before employees from Seattle. You can introduce additional search terms as we did in Listing 7-3.

from Employees where search(Address, "London Seattle Redmond")

select Address.City

Listing 7-3

Searching for multiple terms

This query will rank London employees first and then Seattle employees, and finally, any Redmond employees will be at the bottom of the list.

For every full-text search match, RavenDB computes the indexing score and uses it to determine the ranking of the results. This score is contained in the @index-score property within the @metadata of every search result. You can inspect it by previewing any results from a query from Listing 7-3, as shown in Figure 7-4. @index-score is the last property of @metadata.

Figure 7-4
The Indexing Score Located Within the Metadata

The larger the @index-score is, the more relevant result will be. Indexing score is exposed via score() function . Hence, a query from Listing 7-3 is functionally equivalent to the following query:

from Employees where search(Address, "London Seattle Redmond")

order by score() desc

select Address.City

You can use the score() function to reverse ranking and show the least relevant results first if you need:

from Employees where search(Address, "London Seattle Redmond")

order by score() asc

select Address.City

RavenDB can also give you a detailed explanation of how it calculates the score. Along with the query, include the explanations() function , like in the following query:

from Employees where search(Address, "Redmond Seattle")

select Address.City

include explanations()

This time, query results are accompanied by ranking scores in an additional tab, as shown in Figure 7-5.

Figure 7-5
Explanation Column with Ranking Scores

Clicking on the Show icon will give you a detailed explanation of how the indexing score was computed, as shown in Figure 7-6.

Figure 7-6
A Detailed Explanation of the Indexing Score

This decomposition can help you determine why specific results ordering is the way it is.

Boosting

Not all search terms are created equal. Sometimes, you would like to value specific search terms more than others. Boosting is a process of altering weight factors – so you will be able to make some search terms more relevant compared to others. As we saw in a previous section, RavenDB will assign a weight factor to each word and use it to calculate the index score for every matched document.

Each search term can also be associated with a boosting factor. The higher the boost factor, the more relevant the search term will be. This feature can improve the accuracy of the results, ordering documents so that more relevant ones are at the top of the result list.

For example, you might search for companies in Paris, London, or Seattle but prioritize Paris over two other cities and London over Seattle. To write such a query, you can leverage the boost() function , which accepts boosting factor as a second argument, as shown in Listing 7-4.

from Companies

where

boost(search(Address.City, 'paris'), 15) or

boost(search(Address.City, 'london'), 10) or

boost(search(Address.City, 'seattle'), 5)

select Address.City

Listing 7-4

Full-text search query with boost() function applied

After executing the query from Listing 7-4, RavenDB will return companies from these three cities, ranking as in Figure 7-7.

Figure 7-7
Result Ordering Modified with boost() Function

You can expand the query from Listing 7-4 with explanations, as shown in Listing 7-5.

from Companies

where

boost(search(Address.City, 'paris'), 15) or

boost(search(Address.City, 'london'), 10) or

boost(search(Address.City, 'seattle'), 5)

select Address.City

include explanations()

Listing 7-5

Boosting query with included explanations

With such an expanded query, you will have an overview of scores visible in Figure 7-8.

Figure 7-8
Explanation Tab with Computed Scores

You can click on the Show icon for a further breakdown of how these scores were calculated .

One more possibility to leverage boosting is to make specific fields more relevant. For example, we might need to locate all employees with managerial capabilities. A good candidate for searching is Title and Notes field. However, these two fields are substantially different – while Title contains a current position with the company, Notes contains an employee’s description that may mention various skills and previous jobs. Employees in current managerial positions are more relevant than those with administrative functions in the past .

Hence, we can express that in a query from Listing 7-6.

from "Employees"

where boost(search(Title, 'manager'), 2)

or search(Notes, 'manager')

Listing 7-6

Boosting of Title over Notes

Results of execution are visible in Figure 7-9.

Figure 7-9
Results of Boosting Title over Notes

As you can see, Steven is ranked higher than Andrew since he has the word Manager in the Title field. Using this approach, you can fine-tune ranking and provide better accuracy.

Static Index: One Field

We can take more control over the indexing process by switching from automatic indexes to static ones. You already learned how to create static indexes in Chapters 5 and 6. Listing 7-7 shows a simple index for searching over product names.

map("Products", function(product) {

return {

Name: product.Name

}

})

Listing 7-7

Products/ByName index

Before saving the definition of this new index, there is one more step you need to take – specifying that Name is not an ordinary field but a full-text search field. You need to mark the index field Name as a field that will be treated as a full-text search field.

Click on the Add field button. The field definition panel will open. Populate Name as the name on the field, and switch the Full-Text Search property to Yes. As you can see in Figure 7-10, among Advanced options, there is an Analyzer property that contains Standard Analyzer as a default analyzer for full-text search fields .

After saving the definition of this new index, RavenDB will process all documents from the Products collection, extract the value of their Name property, and apply Standard Analyzer to tokenize product names. You can see index terms in Figure 7-11.

Figure 7-11
Standard Analyzer Index Terms

Looking at raw index entries for products, you can observe an array of tokens generated for every product by Standard Analyzer, as shown in Figure 7-12.

We can now search for all lager beers:

from index 'Products/ByName'

where search(Name, "lager")

all product names starting with cha

from index 'Products/ByName'

where search(Name, "cha*")

or product names where any of the words end up with ra

from index 'Products/ByName'

where search(Name, "*ra")

Static Index: Different Analyzers

In one of the previous sections, we mentioned that using wildcard prefixes can have a significant impact on performance. The static index gives us total flexibility, so we can configure the indexing process to overcome challenges like this.

One possible solution would be to alter the way index terms are calculated. We can change tokenization - instead of Standard Analyzer, we can use NGram Analyzer.

But what does the word NGram means? NGram is a sliding window that moves across the text and produces tokens of the specified length. For example, word “brown” can be split into the following 3-grams: [bro], [row], [own]. Similarly, applying 2-grams tokenization to “jumps” results in stream [ju], [um], [mp], [ps].

NGram tokenizer available in RavenDB will slide windows of sizes 2, 3, 4, 5, and 6 to produce tokens of various sizes. Along with this tokenizer, the NGram analyzer will use StopWords and Lowercase filters.

For example, for sentence

The quick brown fox jumped over the lazy dogs.

NGram analyzer will produce the following 3-grams:

[azy] [bro] [dog] [fox] [ick] [jum] [laz] [mpe] [ogs] [ove] [own] [ped] [qui] [row] [uic] [ump] [ver]

The StopWords filter will eliminate two occurrences of “the” and the full stop.

You can now open the index Products/ByName defined in the previous section and change this index’s analyzer. Scroll down to Fields and locate settings for the Name field. Click on Standard Analyzer, and you will be presented with a predefined list of analyzers , as shown in Figure 7-13.

After selecting NGram Analyzer, save redefined index.

Since you changed the definition of the index, RavenDB will build it from scratch by reading all documents from the Products collection, passing them through NGram Analyzer, and populating index terms. Figure 7-14 shows how index terms look now.

These index terms are different compared with tokens generated by Standard Analyzer. Index terms produced by the NGram analyzer are the union of 2-grams, 3-grams, 4-grams, 5-grams, and 6-grams over product names.

Finally, with the updated version of the Products/ByName index, we can replace the wildcard query:

from index 'Products/ByName'

where search(Name, '*late*')

with wildcard-free variant

from index 'Products/ByName'

where search(Name, 'late')

which will return the same result set but without any performance penalties.

Static Index: Multiple Fields

We defined and configured an index covering just one field in the previous two sections. This section will show how you can expand it to process multiple fields from documents belonging to one or more collections.

Indexing Property from Multiple Collections

We can expand Products/ByName into an index that will provide a way to search not only products by name but also Suppliers and Companies by name. This way, we will create one index that will cover the content of the documents from three collections.

Start by cloning the Products/ByName index – click the Clone button and change the name to Search/ByName.

Add the following two maps:

map("Companies", function(company) {

return {

Name: company.Name

}

})

and

map("Suppliers", function(supplier) {

return {

Name: supplier.Name

}

})

After you are done, the index definition will look as in Figure 7-15.

Since we started by cloning the existing index Products/ByName, your newly created index supports full-text indexing on the Name field out of the box.

Now you can execute search queries that will span content from three collections.

Searching for

from index 'Search/ByName'

where search(Name, "cho")

will return products and companies with cho trigram in the Name, as shown in Figure 7-16.

Figure 7-16
Search Results Consisting of Products and Companies

Similarly, the following query

from index 'Search/ByName'

where search(Name, "ost")

will return products, companies, and suppliers with 3-gram ost in the Name field.

Indexing Multiple Fields from a Single Collection

It can be handy to provide a way to index (and later search) multiple properties of the documents from the same collection. When coding static indexes, you can construct an array that will hold several values. RavenDB will correctly process such arrays, creating multiple index terms per document.

Listing 7-8 shows an implementation of one such index. Do not forget to configure the Query field as a full-text search field.

map("Employees", function(employee) {

return {

Query: [employee.FirstName, employee.LastName]

}

})

Listing 7-8

Employees/ByFirstNameByLastName index

Figure 7-17 shows that first and last names are extracted and inserted as index terms .

Figure 7-17
Indexing Terms Consisting of First and Last Names of Employees

With this index, you are now able to search by the first name:

from index 'Employees/ByFirstNameByLastName'

where search(Query, "Andrew")

or by the last name .

from index 'Employees/ByFirstNameByLastName'

where search(Query, "fuller")

Indexing Multiple Fields from Multiple Collections

The approach demonstrated in the previous section can be expanded further. Since you can load referenced documents during the indexing process, you can collect information from different collections to offer Omni search capabilities.

We could use arrays here to pack multiple properties together, but we will show an alternative approach, where we declare a JavaScript array and populate it from the code. Listing 7-9 shows the definition of the Orders/Search index. Create it in your database, and do not forget to mark the Query field as a full-text search field.

map("Orders", function(order) {

var query = [];

var company = load(order.Company, 'Companies');

query.push(company.Name);

var employee = load(order.Employee, 'Employees');

query.push(employee.FirstName, employee.LastName);

order.Lines.forEach (line => {

var product = load(line.Product, 'Products');

query.push(product.Name)

var supplier = load(product.Supplier, 'Suppliers');

query.push(supplier.Name);

})

return { Query: query }

})

Listing 7-9

Orders/Search index

Let’s analyze this index.

The first line specifies the collection that will be processed – Orders . In the second line of the index definition, we are declaring an empty array:

var query = [];

which will be populated by the code. In the very end, our JavaScript code will return a field query with this very same array as content:

return { Query: query }

Furthermore, the referenced company is loaded for every order processed, and its name is added to the query array – same for referenced employees, considering they have first and last names. Finally, we iterate over order lines, fetch their product, and load a supplier for every product, adding their names to an array .

So, looking at the nesting of indexing levels, we have the following structure:

Order
- Company
- Employee
- OrderLine
  Product
  Supplier

We are descending three levels of references, loading referenced documents, and indexing their properties. Indexing terms will include names of companies, employees, products, and suppliers. With this index, you can search orders by various criteria.

For example , you can search for all orders created by Nancy:

from index 'Orders/Search'

where search(Query, "nancy")

All orders containing tofu can be fetched by executing the following query:

from index 'Orders/Search'

where search(Query, "tofu")

To see all orders related to a supplier Lyngbysild

from index 'Orders/Search'

where search(Query, "Lyngbysild")

Hence, you now have a single index that can serve queries by various criteria.

Summary

In this chapter, we covered the full-text search features of RavenDB. Besides introducing essential elements, we explained the inner workings of full-text search indexes. Finally, you learned about advanced indexing options and how to take more control over indexing with manually written static indexes.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Full-Text Search

Create new playlist

Sign In

Sign Up

7. Full-Text Search

Basics of Full-Text Search

Single Term

Multiple Terms

Searching over Complex Objects

Wildcards

Suggestions

Operators

What Happens Inside?

Text Analysis

Standard Analyzer

Ranking

Boosting

Static Index: One Field

Static Index: Different Analyzers

Static Index: Multiple Fields

Indexing Property from Multiple Collections

Indexing Multiple Fields from a Single Collection

Indexing Multiple Fields from Multiple Collections

Summary

Table of Contents for
7. Full-Text Search