The content analytics architecture

Before getting started with any new tool, it's wise to spend some time and understand how the tool is constructed, or made up, and how it works. In this section, we'll go over the general architecture of IBM Watson and provide a short description of each architectural component along with the flow between those components.

The main components

Watson is built using a robust content analytics architecture made up of the following components:

  • Crawlers
  • Document processors
  • Indexers
  • A runtime search engine
  • A content analytics miner
  • An administration console
    The main components

Crawlers

As the name implies, crawlers crawl through what is known as a crawl space (one or more defined sources for data) and extracts content from those sources. A Watson administrator can define rules that direct crawling behavior. Crawling behavior is defined as configuring how a crawler will effect (system) resources and which sources for data a crawler will use.

You can start and stop your crawlers by hand, or you can schedule a crawler by telling it when it must run for the first time and at what interval thereafter.

Document processors

The document processor component processes crawled documents and prepares them for indexing by applying various text analytics to the documents. Basically, what is happening is that each text analytic applied to a document annotates the document with additional information (inferred from the document), which helps explain what is in the document and perhaps makes it more valuable and indexable. This is the high-level explanation.

The process of applying these text analytics can be thought of as a pipeline where crawled documents enter. Then they are parsed. Finally, a prearranged number of (text analytics) annotators process and remove what is needed from the document.

Indexers

The indexer component takes the parsed (or annotated) documents and builds an index on your content to improve performance during text mining and analysis. Once you start an indexer, it will automatically index each document (after the document is processed by the document processors).

Note that changes made to crawled documents can be included in the index either by manually performing an index rebuild or by setting the options in the collected documents so that the changes are automatically retrieved and made a part of the index.

Search engine

The search engine is server-based. This is a component that facilities all user search and analytics requests. The content analytics miner (explained in the next section) is an example client application that makes requests to the search engine. Depending on various factors, such as the size and user base of an environment, more than one search engine may be used.

Miner (content analytics)

The miner uses a browser-based interface and is used to perform content analysis. Using the miner, client requests are made to the search engine, which carries out your requests on your analytical collections.

Administration console

The administration console (like the miner) is a browser-based component that is used to administer your collections, monitor system activities and logs, and set up users, the search engine, and the miner.

The flow of data

IBM Watson utilizes highly indexed document content called collections for text mining and content analysis. Administrators create, configure, and manage the content analytics collections so that analysts can utilize the miner to analyze data that is in the collections. The flow—through the components described earlier in this chapter—from the creation of a collection through the content's availability for analysis is described as follows:

  1. A collection is created.
  2. Crawlers are configured for the collection.
  3. Crawlers routinely extract data from the defined data sources and store the data as documents (in a cache on the disk).
  4. All documents are read by the document processors.
  5. The document processors run text analytics against the content to prepare the content for indexing.
  6. Documents are saved in the content collection.

Steps 3 to 6 will be a continuous operation until all the documents are read by the crawlers and then stored in the index of the specified collection.

Once you have built a content collection, it is immediately available for analysis.

Exiting the flow

Along the flow, there are three points where extracts can be created to export documents for importing into external systems:

  • After documents have been crawled and stored (in binary form)
  • After indexing (in binary form, with annotations added by the text analysis annotators)
  • After a search has been completed (from the miner, you can export the current search results set)

Deep inspection

This feature must be enabled for a collection by the administrator and is similar to exporting documents. You may use this feature if the number of keywords is so large that it may impact the system's performance while using the analytics miner. Deep inspections are not possible through the miner.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset