Before getting started with any new tool, it's wise to spend some time and understand how the tool is constructed, or made up, and how it works. In this section, we'll go over the general architecture of IBM Watson and provide a short description of each architectural component along with the flow between those components.
Watson is built using a robust content analytics architecture made up of the following components:
As the name implies, crawlers crawl through what is known as a crawl space (one or more defined sources for data) and extracts content from those sources. A Watson administrator can define rules that direct crawling behavior. Crawling behavior is defined as configuring how a crawler will effect (system) resources and which sources for data a crawler will use.
You can start and stop your crawlers by hand, or you can schedule a crawler by telling it when it must run for the first time and at what interval thereafter.
The document processor component processes crawled documents and prepares them for indexing by applying various text analytics to the documents. Basically, what is happening is that each text analytic applied to a document annotates the document with additional information (inferred from the document), which helps explain what is in the document and perhaps makes it more valuable and indexable. This is the high-level explanation.
The process of applying these text analytics can be thought of as a pipeline where crawled documents enter. Then they are parsed. Finally, a prearranged number of (text analytics) annotators process and remove what is needed from the document.
The indexer component takes the parsed (or annotated) documents and builds an index on your content to improve performance during text mining and analysis. Once you start an indexer, it will automatically index each document (after the document is processed by the document processors).
Note that changes made to crawled documents can be included in the index either by manually performing an index rebuild or by setting the options in the collected documents so that the changes are automatically retrieved and made a part of the index.
The search engine is server-based. This is a component that facilities all user search and analytics requests. The content analytics miner (explained in the next section) is an example client application that makes requests to the search engine. Depending on various factors, such as the size and user base of an environment, more than one search engine may be used.
IBM Watson utilizes highly indexed document content called collections for text mining and content analysis. Administrators create, configure, and manage the content analytics collections so that analysts can utilize the miner to analyze data that is in the collections. The flow—through the components described earlier in this chapter—from the creation of a collection through the content's availability for analysis is described as follows:
Steps 3 to 6 will be a continuous operation until all the documents are read by the crawlers and then stored in the index of the specified collection.
Once you have built a content collection, it is immediately available for analysis.
Along the flow, there are three points where extracts can be created to export documents for importing into external systems: