Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Handling files

The next use case we will discuss is searching in the contents of files. The most obvious method is adding logic to an application that will be responsible for fetching files, extracting valuable information from them, building JSON objects, and indexing them to ElasticSearch.

Of course the previously mentioned method is valid and you can go this way, but there is another way we would like to show you. We can send documents to ElasticSearch for content extraction and indexing. This requires us to install an additional plugin. Note that we will describe plugins in Chapter 7, Administrating Your Cluster, so we'll skip the detailed description here. For now, just run the following command to install the attachments plugin:

bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.6.0

After restarting ElasticSearch, it miraculously gains new skills!

Let's begin with preparing a new index with the following mappings:

{
 "mappings" : {
  "file" : {
   "properties" : {
    "note" : { "type" : "string", "store" : "yes" },
    "book" : { 
     "type" : "attachment",
     "fields" : {
      "file" : { "store" : "yes", "index" : "analyzed" },
      "date" : { "store" : "yes" },
      "author" : { "store" : "yes" },
      "keywords" : { "store" : "yes" },
      "content_type" : { "store" : "yes" },
      "title" : { "store" : "yes" }
     }
    }
   }
  }
 } 
}

As we can see, we have the book type, which we will use to store the contents of our file. In addition to that we've defined some nested fields as follows:

file: The file content itself
date: The file creation date
author: The author of the file
keywords: The additional keywords connected with the document
content_type: The MIME type of the document
title: The title of the document

These fields will be extracted from files, if available. In our example, we marked all fields as stored; this allows us to see their values in the search results. In addition, we defined the note field. This is an ordinary field, which will be used not only by the plugin but by us as well.

Now we should prepare our document. Look at the example placed in the index.json file:

{
 "book" : "UEsDBBQABgAIAAAAIQDpURCwjQEAAMIFAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIooAA…",
 "note" : "just a note"
}

As you can see, we have some strange content in the book field. This is the content of the file encoded with the base64 algorithm (please note that this is only a small part of it; for clarity we omitted the rest of this field). Because the file contents can be binary and thus cannot be easily included in the JSON structure, the authors of ElasticSearch require us to encode the file contents with the mentioned algorithm. On the Linux operating system there is a simple command that we use to encode document contents into base64, for example, with a command like the following:

base64 -i example.docx -o example.docx.base64

We will assume that you successfully created a proper base64 version of our document. Now we can index this document by running the following command:

curl -XPUT 'localhost:9200/media/file/1?pretty' -d @index.json

It was simple. In the background, ElasticSearch decoded the file, extracted its contents and created proper entries in the index. Now, let's create the query (we've placed it in the query.json file):

{
  "fields" : ["title", "author", "date", "keywords", "content_type", "note"],
  "query" : {
    "term" : { "book" : "example" }
  }
}

If you have read the previous chapters carefully, the previous query should be simple to understand. We searched for the word example in the book field. Our example document contains the text This is an example document for "ElasticSearch Server" book; we need to find this document. In addition, we requested all the stored fields to be returned in the results. Let's execute our query:

curl -XGET 'localhost:9200/media/_search?pretty' -d @query.json

If everything goes well, we should see something like the following:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.13424811,
    "hits" : [ {
      "_index" : "media",
      "_type" : "file",
      "_id" : "1",
      "_score" : 0.13424811,
      "fields" : {
        "book.content_type" : "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
        "book.title" : "ElasticSearch Server",
        "book.author" : " Rafał Kuć, Marek Rogoziński",
        "book.keywords" : "ElasticSearch, search, book",
        "book.date" : "2012-10-08T17:54:00.000Z",
        "note" : "just a note"
      }
    } ]
  }
}

Looking at the result, you see content type application/vnd.openxmlformats-officedocument.wordprocessingml.document. You can guess that our document was created in Microsoft Office and probably had the .docx extension. We also see additional fields such as authors or modification date extracted from the document. And again everything works!

Additional information about a file

When we are indexing files, the obvious requirement is the possibility of returning the filename in the result list. Of course we can add the filename as another field in the document, but ElasticSearch allows us to store this information within the file object. We can just add the _name field to the document in the following manner:

{
 "book" : "UEsDBBQABgAIAAAAIQDpURCwjQEAAMIFAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIooAA…",
 "_name" : "example.docx",
 "note" : "just a note"
}

Thanks to it being available in the result list, the filename will be available as a part of the _source field. But if you use the fields option in the query, don't forget to add _source to this array.

And finally, you can use the content_type field for information about MIME type, just as the _name field.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Handling files

Create new playlist

Sign In

Sign Up

Handling files

Additional information about a file

Table of Contents for
Handling files