Handling files

The next use case we will discuss is searching in the contents of files. The most obvious method is adding logic to an application that will be responsible for fetching files, extracting valuable information from them, building JSON objects, and indexing them to ElasticSearch.

Of course the previously mentioned method is valid and you can go this way, but there is another way we would like to show you. We can send documents to ElasticSearch for content extraction and indexing. This requires us to install an additional plugin. Note that we will describe plugins in Chapter 7, Administrating Your Cluster, so we'll skip the detailed description here. For now, just run the following command to install the attachments plugin:

bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/1.6.0

After restarting ElasticSearch, it miraculously gains new skills!

Let's begin with preparing a new index with the following mappings:

{
 "mappings" : {
  "file" : {
   "properties" : {
    "note" : { "type" : "string", "store" : "yes" },
    "book" : { 
     "type" : "attachment",
     "fields" : {
      "file" : { "store" : "yes", "index" : "analyzed" },
      "date" : { "store" : "yes" },
      "author" : { "store" : "yes" },
      "keywords" : { "store" : "yes" },
      "content_type" : { "store" : "yes" },
      "title" : { "store" : "yes" }
     }
    }
   }
  }
 } 
}

As we can see, we have the book type, which we will use to store the contents of our file. In addition to that we've defined some nested fields as follows:

  • file: The file content itself
  • date: The file creation date
  • author: The author of the file
  • keywords: The additional keywords connected with the document
  • content_type: The MIME type of the document
  • title: The title of the document

These fields will be extracted from files, if available. In our example, we marked all fields as stored; this allows us to see their values in the search results. In addition, we defined the note field. This is an ordinary field, which will be used not only by the plugin but by us as well.

Now we should prepare our document. Look at the example placed in the index.json file:

{
 "book" : "UEsDBBQABgAIAAAAIQDpURCwjQEAAMIFAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIooAA…",
 "note" : "just a note"
}

As you can see, we have some strange content in the book field. This is the content of the file encoded with the base64 algorithm (please note that this is only a small part of it; for clarity we omitted the rest of this field). Because the file contents can be binary and thus cannot be easily included in the JSON structure, the authors of ElasticSearch require us to encode the file contents with the mentioned algorithm. On the Linux operating system there is a simple command that we use to encode document contents into base64, for example, with a command like the following:

base64 -i example.docx -o example.docx.base64

We will assume that you successfully created a proper base64 version of our document. Now we can index this document by running the following command:

curl -XPUT 'localhost:9200/media/file/1?pretty' -d @index.json

It was simple. In the background, ElasticSearch decoded the file, extracted its contents and created proper entries in the index. Now, let's create the query (we've placed it in the query.json file):

{
  "fields" : ["title", "author", "date", "keywords", "content_type", "note"],
  "query" : {
    "term" : { "book" : "example" }
  }
}

If you have read the previous chapters carefully, the previous query should be simple to understand. We searched for the word example in the book field. Our example document contains the text This is an example document for "ElasticSearch Server" book; we need to find this document. In addition, we requested all the stored fields to be returned in the results. Let's execute our query:

curl -XGET 'localhost:9200/media/_search?pretty' -d @query.json

If everything goes well, we should see something like the following:

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.13424811,
    "hits" : [ {
      "_index" : "media",
      "_type" : "file",
      "_id" : "1",
      "_score" : 0.13424811,
      "fields" : {
        "book.content_type" : "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
        "book.title" : "ElasticSearch Server",
        "book.author" : " Rafał Kuć, Marek Rogoziński",
        "book.keywords" : "ElasticSearch, search, book",
        "book.date" : "2012-10-08T17:54:00.000Z",
        "note" : "just a note"
      }
    } ]
  }
}

Looking at the result, you see content type application/vnd.openxmlformats-officedocument.wordprocessingml.document. You can guess that our document was created in Microsoft Office and probably had the .docx extension. We also see additional fields such as authors or modification date extracted from the document. And again everything works!

Additional information about a file

When we are indexing files, the obvious requirement is the possibility of returning the filename in the result list. Of course we can add the filename as another field in the document, but ElasticSearch allows us to store this information within the file object. We can just add the _name field to the document in the following manner:

{
 "book" : "UEsDBBQABgAIAAAAIQDpURCwjQEAAMIFAAATAAgCW0NvbnRlbnRfVHlwZXNdLnhtbCCiBAIooAA…",
 "_name" : "example.docx",
 "note" : "just a note"
}

Thanks to it being available in the result list, the filename will be available as a part of the _source field. But if you use the fields option in the query, don't forget to add _source to this array.

And finally, you can use the content_type field for information about MIME type, just as the _name field.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset