Designing a production imaging system
This chapter addresses production imaging system design using IBM Production Imaging Edition. The primary focus is on document capture, specifically in regard to the tools for handling discovery, requirements gathering, and functional design. This chapter also explores various alternatives with a look at the advantages and disadvantages of each alternative.
The other areas of the solution, primarily managing images and automating the business process, are equally important. Because these topics are explored in other IBM Redbooks publications, this chapter points you to resources that explore these areas in detail.
By the end of this chapter, we will have established the design for the sample application that we develop later in the book.
This chapter includes the following sections:
5.1 Design goal of the production imaging system
Exploring potential production imaging projects begins with identifying the business goals of the business. At a high level, business goals can include the following examples:
Reduce costs.
Shorten the business process cycle time.
Improve service.
To achieve these goals, we design a system that uses the document capture, imaging repository, and business process management aspects of a production imaging system.
As indicated previously, this chapter explores the details of the document capture design. To gain a deeper understanding of the document repository and business process management design, see the following IBM Redbooks publications:
IBM FileNet P8 Platform and Architecture, SG24-7667
IBM FileNet Content Manager Implementation Best Practices and Recommendations, SG24-7547
Introducing IBM FileNet Business Process Manager, SG24-7509
If the current process uses paper documents, you can eliminate or improve many manual tasks, such as the following tasks, in the current business process to achieve the business goals:
Receiving documents, logging, counting, batching, date-stamping
Sorting documents for filing and distribution
Preparing file folders
Filing documents
Distributing documents for processing
Photocopying for distribution
Manual typing of data
Retrieving files from file cabinets
Searching through files to find documents
Matching documents against exceptions reports
Refiling documents and files
Pending and suspense file management
Keeping calendars or diaries to track follow-up documents
Searching for misplaced and lost files
Reconstructing of lost files
Purging files and removing selected documents for disposition
Transporting documents to and from storage rooms or off-site storage
Filing internal forms or copies of correspondence
In addition to eliminating and improving manual tasks, eliminating paper offers many other potential savings, such as the following examples:
Storage-space savings from eliminating file storage areas
Office space savings (including lighting, heating, furniture, and so on)
Archive filing costs on-site and off-site
Reduced workstation equipment and support costs
Reduced filing equipment costs
Reduced number of microfilm, cameras, processors, viewers, and consumables
Reduced number photocopiers
Reduced equipment maintenance of all types listed previously
5.2 Capture system design
You must consider the requirements that are unique to the area of capture, including the following requirements:
Identifying how and where documents will be acquired, such as scanning, faxes, email messages, and imported documents
Identifying the types of documents the application will process and the page types associated with each document type
Deciding which data you want to capture from each page and which data might be manually typed or obtained by using database lookup
Specifying the business rules that determine whether the captured data is valid
Determining how to handle documents that are structurally invalid, pages that are not recognized, data that does not meet the business rules, or characters that are not recognized with high confidence
Deciding how you want to export the data and documents at the end of the workflow
Before starting implementation, you must define the business requirements through collaboration with the various stakeholders. Initially this task involves examining the documents that you want to process, determining which fields you need to capture, and deciding what to do with the data and document after you capture it.
If you process various document types, you must decide whether the documents are presorted or processed as a mixed batch. If they are presorted, you might be able to simplify implementation by processing each type independently, with a separate application, workflow, or job for each type. However, if the documents are mixed batches, you need a more sophisticated system of page identification and document assembly.
Although the goal is to create a fully automated system, inevitably manual intervention is required at some points. The business requirements must specify how to determine if the information is accurate and how to handle exceptions with the data or the process.
At the early stage of a deployed capture system, it is common to review documents even when they passed validation at first to ensure that the system is doing what is expected in a production environment. As time goes by and more confidence is built in the new capture system, the validation process can be reduced to review only the pages with issues.
One way to look at the design is to consider three categories: document hierarchy, processing tasks, and capture workflow. The document hierarchy defines the structure of the content that we process. The processing task performs the work of the capture system, such as scanning and identifying pages and recognizing data. The capture workflow sequences the tasks into processes that handle the needs of different functional areas or input channels.
The Taskmaster Application Development Guide, SC19-3251, provides extensive tutorials for each of these areas. It is included in the Production Imaging Edition product documentation. Read this guide before you design and implement any capture system.
5.2.1 Document hierarchy
Taskmaster rules operate on batches, documents, pages, and fields. In Taskmaster, this structure is called the Document Hierarchy (DCO). The DCO is a core element of the design of the capture system. In addition to defining structure, the DCO provides the information that the system needs to assemble documents. It also enforces the integrity of the batches, documents, pages, and fields by using the information in the DCO. An application can have many DCOs to accommodate applications that require different classes of document structures.
The fictional company in this book, Fictional Insurance Company A, has an application with a simple document hierarchy. It has three document types: auto claims, estimates, and invoices. Within these document types, specific types of pages might only occur in a specific sequence. If the auto claim can have more than one page, the company can define the first page as a unique page type. This way, the system can determine where an auto claim document begins and enforces integrity. When the pages are reordered, a document is flagged as invalid if the wrong type of page is set to page one.
Beneath the batch level, the document hierarchy defines the following information:
The document types that the application can process. You may have only one type, or you might have multiple types. For example, the auto claims application processes claims, accident reports, invoices, estimates, and police reports document types.
The page types within each document type. Each document might have only one page type, or it might have multiple types. The accident report document type includes three types of pages: front page, diagram page, and signature page.
The number and order of pages within each document type. Pages can be required or optional. The accident report document has three pages. An invoice can contain one or many pages.
The data fields within each page type. Data fields can also be required or optional. The invoice document has single fields such as invoice number, invoice date, and vendor number. It also includes repeating line items with fields such as item number, item description, quantity, unit cost, and total cost.
In this scenario, after gathering information about the document types and their properties, we design a document hierarchy and enter it into the system by using Datacap Studio. For details, see Chapter 6, “Implementing the capture solution” on page 179.
5.2.2 Capture processing tasks
In many instances, documents are acquired as a stream of pages where little information is known about the structure or content of the pages. Initially, the type of document and the processes that need to occur to correctly handle the document are unknown. For example, when documents are scanned, the input to the capture system might only provide a series of image files and the type of batch. The job of the capture system is to make sense of the images and perform a series of tasks that process them appropriately.
What tasks are involved in the capture process? Typically the tasks include extracting useful data from the input, validating the input, formatting documents, and outputting the data and documents to business systems. Poorly scanned images and documents of various types that require human reviews might be mixed together. These factors introduce exceptions and variation that need to be detected and processed effectively.
The capture process performs the following essential elements, which are incorporated into the capture system design almost always in the order shown:
1. Document acquisition
Documents are input into the system by scanning, faxing, importing, emailing, or web services.
2. Image enhancement
Images can be enhanced to improve recognition and readability and to reduce file size. This enhancement can be done at a scanner by using the built-in capabilities of the hardware or driver. Alternatively, it can be done by using the Taskmaster image enhancement features.
3. Page identification
The type of each page must be identified (classified), automatically or manually. For example, a barcode can be used to automatically identify a page. A document often consists of a specific type of leading page followed by one or many trailing pages.
4. Document assembly
The capture assembles multiple images into documents where a single scanned batch or fax transmission can contain multiple documents. Such information as the page types, number of pages, and order of the pages provides the basis for automating document assembly. The document type is typically determined automatically by using the document creation function.
5. Recognition
Recognition includes using Optical Character Recognition (OCR), Intelligent Character Recognition (ICR), Optical Mark Recognition (OMR), barcode recognition, or database lookups to lift data and supplement the data with additional information.
6. Fingerprinting
Fingerprinting is commonly used to differentiate between multiple formats of the same page type. Fingerprinting matches the best variation on a page type and captures the offset that is needed to adjust an image for locating data accurately.
7. Locating data
Data in text on a page can be located in zones, by using keyword searches, through regular expressions.
8. Validation
Extracting data by using any of the recognition methods has inherent limitations for many reasons. Examples of such reasons include a damaged source document, poorly scanned document, poorly printed document, and inaccurately entered data. Validation of the data is essential to obtain accurate results by using such techniques as check digits, length checks, format checks, cross-totaling calculations, value comparison, and data lookups.
9. Routing
When exceptions occur, routing is used to queue batches or documents for exception handling. For example, a document that is missing a page or poorly captured might need to be fixed at a scanning workstation.
10. Verification
Often a design goal of the capture system is to reduce or eliminate manual verification. However, when low confidence results or validation errors exist, they might need to be handled by human operators. Correct results need to be confirmed, and errors need to be resolved. Verification can include typing from recognition, typing from image when recognition is not used, and typing from documents when the documents are not digitized.
11. Export
The system transfers documents and the data to external systems (for example, IBM FileNet Content Manager) where they are stored and processed by the business. Extracted data is exported to XML files or databases to update applications.
 
Sequence of the design elements: Although most capture system design follows the previous order, the process can be done in multiple ways. For example, you can use fingerprinting (step 6 on page 144) as page identification (step 3 on page 144) before recognition (step 5 on page 144). Although fingerprinting is not a preferred way of page identification, it can be used.
Page identification: Multiple methods of page identification (PageID) are possible. Using fingerprinting as mentioned previously is one such method.
With Taskmaster, these elements are implemented as rules. Rules are run by the Taskmaster Rulerunner service. This method provides a flexible way to implement all of the variations and exceptions that are seen when capturing content in a scanned or document format.
In this scenario, many processing tasks and rules are already defined in Taskmaster. We only need to adjust these tasks to meet the specific document structure. Taskmaster unifies the tasks definition with the document hierarchy. We configure rules and tasks with the same tool, Datacap Studio. For more information about using Datacap to configure rules and tasks, see Chapter 6, “Implementing the capture solution” on page 179.
5.2.3 Capture workflow
During the data capture process, documents go through a workflow that consists of several tasks. Some tasks require operator intervention, where others run automatically. A workflow job consists of a series of tasks and defines a way to process a particular batch of documents. Because the tasks can be reused in multiple jobs, we can add as many jobs as we need to handle the processing scenarios in this book. The design must include workflow jobs that specify and execute the capture process.
For example, we might have several input channels, such as scan, fax, and email, for the same types of documents. We can construct three workflow jobs, one for each input channel, and have each job share tasks for recognition, data extraction, and export.
In addition to defining the process flow, the workflow also implements functional security that so that we can determine who can access the work in progress and who can perform specific tasks with particular types of documents. For example, processing claims for motorcycles might be similar to handling claims for automobiles. However, the people who verify the documents might be in a different department. A separate workflow might be used to accommodate this difference.
For a definition of the capture workflow for the scenario in this book, see Chapter 6, “Implementing the capture solution” on page 179.
5.2.4 Capture design considerations
This section highlights the areas to consider when designing a capture system and indicates the alternatives that are available. You can select from multiple options depending on your business and technical requirements.
Document acquisition
Taskmaster services various input channels that deliver documents in several formats. Channels or methods of capturing documents include scan, fax, email, file import, and a web service. Some of the considerations for each channel are provided in the following sections.
Scanning
Direct scanning is typically done by internal users. Both thick and thin client options are available.
Thick client scanning is used in centralized scanning operations that use mid-range to high-volume scanners that have heavy-duty cycles. These types of scanners support continuous operation in multiple shifts. Even though a lower-cost scanner might have a high scan rate, it might not be designed for continuous operation.
The Taskmaster scanning user interface is production-oriented to support highly efficient operation of the scanner. In this environment, scanners are operated nonstop. Scan operators occasionally check the scan quality of images. The goal is to maximize the throughput of the production-level scanners.
Thin client scanning can also be used in centralized scanning operations but is more commonly deployed for distributed capture. Common deployment models are dedicated scanning stations connected to mid volume production scanners and user workstations connected to low volume scanners.
Multifunction devices (MFD) have integrated scanner, printer, copier, and fax capabilities. Production-level MFDs can be operated as stand-alone devices without being connected to a workstation. In this mode, the MFD control panel is used to control the scanner. Images can be transferred to a well-defined storage location by using the network filing, File Transfer Protocol (FTP), or email functions of the device. Taskmaster can import and process the documents by using its virtual scanning and email import actions.
One area of common confusion is the difference between thin client scanning and operating an MFD directly. If you scan with an MFD by using thin client scanning, the MFD is connected to a workstation by using a TWAIN driver and the web user interface on the workstation provides the scanning control panel. This method is used with lower-end desktop MFDs and is not used with higher-end production MFDs.
Consider the following additional factors:
Many current generation devices include image enhancement features that are run within the scanner hardware or in the scanner driver. In either case, the resulting image might have improved readability, improved recognition results, and reduced file size.
Scanners are available for specialized purposes, such as remittance scanning and large format document scanning. These devices might not have Image and Scanner Interface Specification (ISIS) or TWAIN drivers. Therefore, they interface with Taskmaster by using import features.
If you expect an MFD to be used full-time as a scanning device, consider using a dedicated scanner instead. Production scanning can handle larger scan jobs that might occupy an MFD that needs to be shared by a workgroup.
Fax
Production Imaging Edition software works with fax server products so that documents that are sent to a fax server can be imported into the capture system and processed in the same manner as scanned documents. The Production Imaging Edition software does not include a fax server.
Fax is typically used by external users. The trend in many organizations is to reduce the internal use of fax for capturing documents. This trend is due to the lower quality of the image and the greater time needed to send a fax compared to remote scanning. However, because fax requires low bandwidth, its use is common in situations where only dial-up connections are feasible.
The primary disadvantage of fax is low image quality. The quality of the equipment varies resulting in inconsistent image quality. Fax image resolution is low. Standard mode provides a horizontal scan at 200 or 204 scan lines per inch. It provides a vertical scan at 100 or 98 scan lines per inch. Fine mode provides a horizontal scan at 200 or 204 scan lines per inch. It provides a vertical scan at 200 or 196 scan lines per inch.
Each fax transmission is received as a single TIFF or PDF file that contains multiple images. Taskmaster can burst the file into individual image pages for processing by the system. The image enhancement actions improve the ability of the system to recognize text. Taskmaster can normalize the dimensions of the image so that all the images are 200 dpi in both dimensions. It can also compress images to the TIFF Group 4 format.
Email
Taskmaster can capture and process email messages and their attached files. In addition to scanned images, Taskmaster can accept various electronic formats, such as word-processing documents and spreadsheets. Electronic documents can be converted to TIFF by Taskmaster so that they can be processed as images for data extraction and export.
Consider the following common scenarios for using email:
Trailing documents can be received directly from customers or other external parties. In the scenario in this book about the insurance company, customers who register a claim can be allowed to send supporting documents by email to a service email account.
Email can be used as a replacement for fax as a way to transmit scanned or electronic documents.
Email can be used to interface with MFDs.
File import
File import is a common method for inputting files into the system. The virtual scan (VScan) features of Taskmaster are used to import files. File import can be done in an attended or unattended mode. In an attended mode, a user starts the virtual scan by using the thin- or thick-client user interface. In an unattended mode, the virtual scan is run by the Rulerunner service, which runs as a Microsoft Windows service.
Consider the following common scenarios for using file import:
Receiving images from an external party. For example, a financial institution might receive loan file images as part of the process for purchasing loans from another financial institution.
Receiving images scanned by a scanning service. For example, large quantities of documents might be scanned by a third-party service as part of a backfile conversion.
Interfacing with fax or MFDs.
Interfacing with a scanner that does not have a TWAIN or ISIS driver. Some specialized scanners operate in this fashion.
Web service
Taskmaster exposes the document processing capabilities as a web service. The web service can run the background document processing tasks. This method is used by software applications that need to process documents.
Consider the following common scenarios for using a web service:
Processing previously scanned and stored documents that were not previously processed for recognition. A bank that stored loan documents when a loan was originated might want to perform data extraction on the same documents years later when a loan is modified.
Providing a service where documents can be processed in an ad hoc manner. An organization might provide a service to upload documents for recognition and transformation through a web application or portal.
Centralized capture
With centralized capture, dedicated staff and equipment process documents in a factory-like setting. Documents are mailed or delivered to the central location where documents are prepared into controlled batches. Batches are scanned on high-speed scanners. Other tasks, such as indexing, data entry, and fixup, are performed on separate workstations so that each task is optimized and labor and other resources are used efficiently at the central location.
Centralized capture offers the following advantages:
Economy of scale
Standardized processes
Dedicated trained personnel who only do capture-related tasks
Easier to maintain image quality controls
Availability of original documents to verify authenticity
Centralized capture has the following disadvantage:
Documents must be delivered to a central location.
Users understand less about the documents.
Corrections might require returning documents to the sender or interacting with remote uses to correct problems.
Decentralized capture
With decentralized capture, remote offices or individuals scan, fax, and process documents, but they do not send the paper to a central location. Staff is not dedicated to performing capture activities. Capture might be done directly by the customer or by an external business partner.
Decentralized capture has the following advantages:
Documents do not need to be mailed or shipped to a central location.
Documents are stored into the repository more quickly.
Users can correct errors immediately.
Users understand the documents and can more accurately enter and correct data.
Work can be offloaded to a partner or customer by using self-service.
Decentralized capture has the following disadvantages:
Equipment is needed at each location.
It is harder to maintain standardized processes.
More users need to be trained.
Users do not perform capture functions all the time and, therefore, do not handle the tasks as efficiently.
Image quality varies, and image quality issues are more difficult to correct.
Authenticity is more difficult to verify.
In many instances, organizations use a blend of these models. The capture system needs to accommodate the constraints and demands of the business. Organizations have multiple applications that require one, the other, or both models. Fortunately, in this scenario, the production imaging system accommodates both models, so that we can decide which method works best for each application.
We must also consider the network capacity to determine if it is sufficient to handle the required load. In some locations with low bandwidth, networks might need additional bandwidth to accommodate higher volumes of imaging network traffic.
In either scenario, the background processing of documents is handled centrally using the Rulerunner service. Background processing includes image enhancement, OCR or ICR, format conversion from input or for export, and export. Because these are processor intensive activities, they are handled most effectively in servers or high-end workstations. In this manner, client workstations do not need to have software installed to perform these functions.
In the scenario in this book, we use centralized and decentralized capture. Policy holders can send signed claim forms and follow-up documents remotely by using email or fax directly to the capture system. Repair shops can submit invoices by using fax or email. In both cases, the documents are verified and processed centrally. Regional offices and agents capture paper documents by using web-based scanners and MFDs to scan and verify the documents remotely by using Taskmaster Web. Each party can still choose to send paper documents to the current capture center for central processing.
Image enhancement
Images can be enhanced to improve recognition and readability and to reduce file size. Image enhancement is most important when using OCR and ICR or to improve the format of faxed images. Taskmaster includes image enhancement capabilities for this purpose. The current generation of document scanners often includes image enhancement capabilities in the hardware or scan driver that can be configured in the scanning user interface. Use the capabilities of your scanning hardware for image enhancement, and supplement those capabilities with the Taskmaster enhancement features.
Page ID and document assembly
Page ID and document assembly are often referred to as classification. This process identifies the type of each page in a batch and creates documents from the stream of pages. Page identification is the process of identifying the type of each page. Document assembly is the process of determining where each document begins and ends.
Orchestrated classification
Taskmaster performs automated classification by using the orchestrated classification technique. Orchestrated classification uses page identification rules, document integrity rules, and document creation rules to automate the classification process. Classification can also be done manually in a scanning or verification user interface.
Orchestrated classification uses a set of rules that takes a stream of pages. Then it optionally enhances images, identifies each type of page using one of many methods, creates documents from the pages, and validates the resulting structure. All the classification processing can occur in a single module in one workflow step. If necessary, you can have multiple types of classification modules. Classification can use any of the processing actions in Taskmaster.
Page identification
Documents are created and separated based on the page types and a set of document integrity rules. Pages can be determined by one of the following methods:
Barcode
Pattern match using image anchors
Pattern match using text anchors
Match image-based fingerprint
Match text-based fingerprint
Match regular expressions to recognized text
Document structure using rules
Consider using barcodes as the primary method of page identification for forms that you control. When you do not control the layout of the form, you can use the other page identification methods depending on the characteristics of the pages.
Document assembly
In Taskmaster, the system determines document separation and document type by matching the document hierarchy to the identified pages. After pages are identified, Taskmaster uses the information in the document hierarchy to determine the correct document type. For example, a page of type “Accident Diagram Page” is part of an “Accident Report” document, where a page of type “Auto Claim Page” is part of an “Auto Claim Form” document.
Each page has the following variables that define the structure of the parent document:
Maximum number of pages of this type for each document (0 means no maximum)
Minimum number of pages of this type for each document (0 means no minimum)
Order, which is the position of this page relative to other pages in the same document (0 means any position)
Taskmaster uses the information in the document hierarchy to assemble individual pages into multipage documents.
A common approach to document separation is to use barcode sheets between each document or printing barcodes on the first page of a document. During scanning, several documents of varying lengths can be scanned in a batch, with barcoded sheets separating each individual document. The system saves the documents as separate documents automatically. Barcoding can also be used to identify the type of document or pages.
Position barcodes vertically. Keep in mind that scanners and fax machines can produce vertical lines on a page when dirt is on the scanning sensor. If the line is parallel to the barcode line, it can make a barcode unreadable. If the line runs perpendicular across the barcode, it is readable.
Recognition, fingerprinting, and locating data
Recognition is used to read data from images by using OCR, ICR, OMR, or barcode technology. Recognition is used in two primary use cases: to automate document indexing or to reduce data entry typing.
Indexing is the process of identifying the documents stored in FileNet Content Manager. Documents are identified with properties that are stored in a content engine catalog. The process of entering these properties is called indexing. Users search for documents by using these properties. As a result, these properties must clearly identify each document with information, such as claim number, invoice number, customer ID, and vendor name. Usually only a few properties are used to index a document.
Data entry is the process of typing data into a database or application system. Documents can contain dozens or hundreds of fields of data on many pages. In a manual process, users type data by looking at the paper document or at an image of the document in a window. Typing from a window is called “type from image”.
When we design the capture system in the scenario in this book, we use recognition features to read the data from images so that we can reduce the amount of manual typing.
Data recognition and extraction can be highly accurate when certain conditions are met. An understanding of the document characteristic is vital because you need to choose the most appropriate techniques or combination of techniques that are most effective on the types of documents that you have.
Documents come in two classes: document that you control and documents that you do not control. Documents that you control are often internally generated documents that can be redesigned to make recognition more effective. If a document is not designed for recognition, it can be more difficult to process and can have lower confidence from the recognition process. However, documents that are outside of your control generally cannot be redesigned, and you must accommodate the existing format.
When you control the layout of your forms, a good practice is to redesign forms to improve recognition. In some cases, this redesign might be necessary to achieve high confidence recognition results. Some forms might need minor changes to improve results, but others might need extensive redesign. Engaging a form design expert is one way to achieve the best design for your documents.
The following factors can improve results:
Use barcodes to identify document types and prepopulate a document with data. For example, internally generated documents can be printed with barcodes on them to identify the document type and indexing data. When they are returned, you automatically recognize the document type and indexing data rather than manually type the data. The business process capabilities have features that match the document to a pending task that is waiting for the document to arrive. Using barcodes in this way is included in our use-case application scenario.
Use machine print whenever possible. Forms that are completed and printed online generally contain printing that is easy to extract. Prefilled data of printed forms can also be easy to extract.
Use hand printing only in controlled circumstances. Handprinted information must be printed in boxes or other guides that show the user where to enter each individual character. Other types of handprint require specialized software beyond the scope of this publication.
Clearly identify data locations by using unambiguous prompts or by using specific zones on the page.
Include multiple fields that cross-check the data, or use data that is designed to self-verify by using check digits.
Fingerprinting
Within a specific document type, often many variations can exist in the format and layout of the printed information on the page. The existence of variations does not refer to minor shifts in position on a page. Instead, it refers to the wider variations from a different version of a form or from documents that are created by outside parties where you cannot control the layout. In our design in the scenario, we must determine which method to use to handle these variations.
Fingerprinting is a technique that Taskmaster uses to differentiate between different multiple formats of the same page type. Fingerprinting matches the best variation on a page type and captures the offset needed to adjust an image to locate data accurately.
For highly structured documents, you can also use image or text-based anchor fields. These marks are on the page, in specific positions. This approach is effective for fixed-format forms where you have control of the forms design.
If you do not use fingerprinting or anchors, you can still deal with the format variation by using keyword searches and regular expressions to find data within the full text of a page.
In the design, we must examine all of the different types of documents and decide which approach is more effective. The rule of thumb is that, if you can hold two different pages 10 feet away and see which page is different, you can use fingerprinting to identify the layout.
Location techniques
Data in text on a page can be located in zones or by using keyword searches and regular expressions. These techniques can be used in combination to handle forms that have both fixed and variable data locations.
If you use fingerprints or anchors, then you can accurately register the location of zoned fields. Zones can be prepared in advance or dynamically. The most flexible option is Intellocate, which allows Taskmaster to learn page layouts from users. It uses a hybrid approach that combines both zones and text searching.
When we design a system, we must examine the individual documents to determine which location technique can be used for the data on our documents
Intellocate
Intellocate is technology that allows Taskmaster applications to learn. Location rules are used to automatically locate some of the data from these documents by using keyword searches or regular expressions. Information that cannot be automatically found by using Intellocate can be identified and captured quickly and easily by a verify operator using the Click’n’Key capability.
With Click’n’Key, the operator clicks the words on the image, and the data is entered into the data field. Behind the scenes, the system remembers the locations where the user clicked. When this task is complete, Intellocate saves the zones for the fingerprint. Then, the next time a similar document is encountered, the fingerprint is matched, and all of the data is read by the zones.
For more information about Intellocate techniques, see Chapter 10, “Dynamic technologies” on page 305.
Data validation
The purpose of validation is to determine whether captured data conforms to specified business rules, as in the following examples:
Does an expense lie within permitted limits?
Are dates valid and within a permitted range?
Is the total cost calculated correctly?
Does the vendor information match the information stored in a database of approved vendors?
Does a field value match one of a set of permitted values?
Taskmaster performs validation by using rules that you create and attach to specific items in the document hierarchy. For example, to check whether an expense lies within permitted limits, you might first create a rule that performs the following tasks:
Ensures that the expense field contains numeric data in a valid currency format
Determines if the value is less than or equal to the maximum permitted limit
Performs exception handling if the value is invalid or higher than the permitted limit
OCR and ICR read what the user entered. The user can still write or print invalid data on the document. The scope of the capture process usually includes validating that data was correctly read from the page and that the content of the data is valid.
Validation can include simple field-level checks, field cross-checking, and lookups to external data sources. At the field level, it checks the ranges of values, valid format (for example, dates), check digits, choice lists, and so on. Cross-checking can include totaling columns and checking against total amounts, such as checking the line item detail total against an invoice total field. Taskmaster can query database tables to look up valid values. Lookups are used to check account numbers, product codes, and other sorts of master data.
Depending on the business use of the data, the application might need absolute data accuracy, but sometimes applications might want to accept lower confidence data.
For example, if you are processing financial transaction, it is likely that you want data to be accurate. However, if you are processing survey cards, you might want to accept incomplete responses of lower confidence because you are more interested in the receiving as many responses as possible.
For each type of page and for each field, you must determine the level of confidence that is flagged by the system and displayed to an operator for verification.
Validation is run in the background task after recognition and at the Verify task. Data that does not conform to business rules is flagged. Documents that do not conform to business rules are routed for verification by using workflow. Data can be flagged down to the character level. Validation is also executed from the verification user interface when a user types corrections.
Routing
Workflow routes exceptions for manual verification. You can route an entire batch, or you can split batches so that problem documents are handled in a separate batch from the good documents.
In the insurance company scenario, we must determine the types of exceptions that we will handle and who will handle them. Common types of exceptions include rescanning, page identification, data verification, and data exceptions.
Verification
During verification, an operator views data entry panels and document pages for manual checking, for possible correction, and to type data. You want to display pages to an operator when one of the following primary conditions exists:
The batch failed document integrity checking during document assembly.
A page contains one or more characters or OMR fields that were marked “low confidence” by the recognition engine.
A validation rule failed, indicating that the data does not conform to business rules.
The application does not recognize that data and verification screens are being used to type data from the image or the document, which can be for indexing or data entry purposes.
When a batch fails document integrity checking during document assembly, you can have a user manually identify pages, by using a special verification task called Flex ID. Flex ID handles the manual page identification, and you display the thumbnail images of the page. Then the user rearranges the thumbnails and selects the page type for unidentified pages.
The other conditions require the user to enter or correct data. Several thin- and thick-client verification user interface options are available. All of these options display the image, the data fields, and snippets of images where data is on the image page.
Take a single-pass approach to verification. Some other systems promote a two-pass approach where individual character-level corrections are handled in the first pass, and in a second pass, field-level corrections are made. Our experience is that a single-pass approach is more efficient. The user interface has keyboard shortcuts that navigate efficiently at the character level, making a separate first pass unnecessary.
You can control what is displayed to the user in the design and configuration. As part of the design in this scenario, we decide what level of verification is needed. Many options are available. For example, we can display every document and page, only pages where we have data, the first page of a document, only documents and pages with exceptions, and pages that do not conform to business rules. This setting is a business decision and varies depending on factors such as the types of documents, business controls, or the comfort level of the user with automating the process.
Multipass verification
To satisfy business requirements, you can consider whether you want more than one person to verify the data. Multipass verification can display the same page to multiple operators to ensure accurate data entry and verification. In some cases, the business financial controls require a separation of duties that requires more than one user to enter or validate specific data fields.
Taskmaster supports two main implementations of multipass verification: two pass and double blind. Other implementations are possible, but this book focuses on these two that are supported by the standard user interfaces.
In two-pass verification, the following process occurs:
1. An operator (or a recognition engine) enters the initial value for each field.
2. Taskmaster displays the page to a second operator but hides the initial values. The operator enters a new value for each field. If using a recognition engine to implement the first pass, you might choose to show only low confidence fields to the operator.
3. For each field, Taskmaster compares the new value to the initial value. If the values match, Taskmaster accepts the value. Otherwise, the operator must re-enter the value. Taskmaster accept the value only after the operator enters the same value two times consecutively.
In double-blind verification, the following process occurs:
1. An operator (or a recognition engine) enters the initial data values.
2. Taskmaster displays the page to a second operator but hides the initial values. The operator enters a new value for each field, and Taskmaster saves all the values (no comparing).
3. Taskmaster displays the page to a third operator. The operator can see both the initial value and the second value.
4. For fields where the initial value and the second value are different, the operator must select which value is correct or enter a new value. If entering a new value, the operator must enter the same value two times consecutively.
Web-based verification options
Taskmaster includes several different user interfaces that have different design features. In the insurance company scenario, we choose the user interface or combination of user interfaces that meet our requirements. For information about the detailed features of each interface, see the Taskmaster Application Development Guide, SC19-3251.
Table 5-1 summarizes the key features.
Table 5-1 Features of various user interfaces
Design features
Task ID
Verify recognition, custom panels, batch restructuring, and two pass
prelayout.aspx
Verify recognition, custom panels, and line item details
averify.aspx
Verify recognition and overlay data entry fields on the image
imgEnter.aspx
Key-from image, manual page identification, manual registration, two pass, and double blind
aindex.aspx
Manual page identification and fixup with thumbnails
ProtoID.aspx
Fixup tasks
Fixup activities adjust and enhance pages, move pages within the batch, reconstitute documents, and reorganize the batch.
If a workstation has a scanner attached, the fixup can also rescan. Rescan is a physical process that scans one or more pages and replaces existing image files with the new files. One constraint for rescanning is that the workstation where rescanning occurs must have access to the physical documents.
In centralized operations, scanned documents are often stored in boxes on-site for a short time when rescanning occurs. In decentralized environments, batches must be routed to the person who scanned the document.
Many customers find it is more efficient to rescan an entire batch rather than to pull out the individual document and rescanning the individual page. This preference depends on the application.
Sometimes batches are sent to a fixup task to delete documents from a batch. For example, you might need to send the original document back to the sender if it is not a valid document. In this case, an operator must pull the original paper document from the physical batch and send it (perhaps by mail) to the originator. The images can be flagged for deletion from the batch.
Export
Taskmaster supports data and document export.
Data export
Taskmaster can export data to a text file, an XML file, or a database. This choice depends on the means of interfaces that are available in the target application. These three methods are equally easy to configure. You can export multiple formats for the same batch.
Document export
Document export creates documents from the scanned pages and stores the documents in one or more systems. The Production Imaging Edition component for storing document is FileNet Content Manager.
In addition, the Taskmaster software can export to a file system, other IBM Enterprise Content Management (ECM) system, third-party ECM systems, and collaborative systems including Microsoft SharePoint. When a batch is exported, the destination for each document is determined at the document level. Therefore, documents in the same batch can be stored in multiple systems.
When designing the system, you must determine the output system, the file format, and the document properties of the exported documents.
Considerations for exporting file formats
The primary output options are TIFF, PDF, and PDF/A. Documents are processed as individual TIFF image pages, but they can be converted to different formats for export. For example, scanned or faxed images in TIFF format can be converted to PDF/A for export.
In addition, you might consider the following export options:
Documents can be exported in their original format. For example, when a customer sends a Word document by email, the document is processed by using the image functions as TIFF pages. When the export is done, the original Word document can be exported.
Documents can be exported in multiple formats. In the same example, the Word document can also be rendered in PDF/A format and stored for archival purposes. It can also be stored in the original Word format.
Images can be redacted, where specific area of the image is erased or obscured. Redaction can be used to cover ID numbers, credit card numbers, or other protected information.
Additional capabilities are available for conversion to other formats including a rich text format (RTF), HTML, Excel, Word, and various text formats. These formats are used for specialized applications and are secondary options for imaging applications.
Considerations for exporting document properties
When documents are stored in the repository, you store the document and catalog the document by using a small set of document properties, such as the claim number, policy number, customer name, or invoice number. Now you use the data collected throughout the capture process on the batches, documents, and pages. You store selected data fields with the documents into the FileNet Content Manager repository.
In FileNet Content Manager, each document belongs to a document class where the document class specifies a list of document properties. These classes are mapped to corresponding Taskmaster document types, in a one-to-one or one-to-many relationship. The properties of the document class map to Taskmaster fields and variables on the batches, documents, or pages.
A unique feature of Taskmaster is the ability to update the properties of an existing document in the FileNet system, not just the document you are exporting. For example, the document you are exporting is a change-of-address request, and a field contains an updated postal code. In this case, you can update the postal code on other documents that are already stored in the FileNet system.
You can also use this update feature to implement an early export scenario. In this case, documents are exported before the entire recognition and verification workflow job is completed. With this scenario, documents are exported to be stored under document management system, even though all the steps of the workflow have not completed. In a later workflow task, additional data can update the document properties.
Exporting to the FileNet Business Process Manager process
After a document is stored in the FileNet Content Manager repository, both the document and the data are available to a FileNet Business Process Manager (BPM) process. The act of adding the document triggers an event to FileNet BPM, initiating a new BPM process or updating a currently running BPM process. Any data in the document properties, and the document itself, can be included in the BPM process.
With existing BPM processes where the process steps include data entry, consider shifting the data entry function to the capture system. With the data recognition and extractions capabilities of the Taskmaster, these functions can be implemented so that they require less manual typing of the data.
5.2.5 Discovering the capture process
Every organization has a different starting point with different requirements. Some organizations already have implemented a capture process and are updating their system to take advantage of advanced capabilities. Others are implementing capture for the first time or are using Taskmaster for the first time on a new application. Some want to scan documents and store them in an ECM repository. However, others want to extract data from documents for updating business systems. Yet others want to focus on classifying documents and reducing manual paper document preparation. In any case, understanding the capture process is a key step in the design process.
As a starting point, in the insurance scenario, we must define the business requirements through collaboration with the various stakeholders. The process of gathering required information about a business problem is called a walkthrough. In a walkthrough, you learn about the current sources and methods of handling documents, and examine the documents. You learn the characteristic of the documents, how they need to be processed and stored, and determine which fields need to be captured and what to do with the data after you capture it.
In addition to the mechanics of configuring the capture application, you must consider the physical aspects of the paper handling tasks, such as whether document types are mixed together or presorted. If they are presorted, you might be able to simplify implementation by processing each type independently. If you process mixed batches, you can automate and reduce the amount of manual sorting using the orchestrated classification techniques.
If data or index values are manually typed, you can examine the characteristic of the data printed or written on the documents and determine if the data can be extracted by the software. Alternatively, you can propose a redesign of the document layout or forms so that the data can be captured more effectively.
Although the goal is to create a fully automated system, at certain points, manual intervention is required. The business requirements must specify how to determine if the information is accurate and what you will do if a problem arises.
Because this Redbooks publication is an introductory guide, it does not provide a detailed methodology for determining business requirements. Instead, it provides guidance about the key information that you need to gather and review.
5.3 Requirements gathering
This section lists the questions that help to identify the application requirements and the relevant details to design the capture system. The information that you discover includes the following categories:
Current capture or document processing environment
Physical locations that receive and process documents
Types of documents, their characteristics, and the data they contain
Business rules that validate whether the data is valid
Document volumes and time constraints
Business requirements for dealing with exceptions
Output requirements for data and documents
Scanner requirements
Hardware and software requirements
5.3.1 Requirements for current capture or document
processing environment
With this category, you discover the characteristics and details of the business processes and systems that are currently in place. You look for the specific tasks that are performed, the sequence of those tasks, and the overall time it takes from document arrival through completion.
Scanning
If the current process involves scanning documents, you must identify the current systems and methods. You can consider redesign of the existing processes in light of the capabilities of the Production Imaging Edition system compared to existing methods. You can evaluate potential reuse of existing processes, equipment, and systems.
Identify the scanning requirements:
Are paper documents currently being scanned?
At what point in the business process are they scanned: upon arrival, in the middle of the process, at the end of the process, or a mixture?
What equipment and software are being used to scan the documents?
Will the current equipment be replaced, or will it be used with the new system?
Can the scanners handle the projected peak volumes based on comparing the scanner specifications to the scan volume?
Will the scanner handle de-skewing and noise removal?
For each location, will scanning be done by using thick-client ISIS or thin-client TWAIN scanner drivers?
The preferred practice is to test a specific scanner interface, driver, and scan hardware in a test environment.
What happens to the paper documents after they are processed? Are they stored on-site, returned, stored off-site, or destroyed?
Will the new system change the way paper documents are handled after they are scanned?
Processing
Processing paper documents is a labor-intensive operation. By explicitly documenting the current processes, you can identify the specific areas of process improvement.
Identify the processing requirements:
How many people “touch” the document from arrival to completion, and in which departments or locations do these people work?
What is the current document handling process?
Are documents processed centrally or at remote locations?
How many people are involved in processing documents?
Which processing is currently being performed?
 – Receiving documents, logging, counting, batching, and date-stamping
 – Sorting documents for filing and distribution
 – Preparing file folders
 – Filing documents
 – Distributing files or documents for processing
 – Photocopying for distribution
 – Manual typing of data
 – Retrieving files from file cabinets
 – Searching through files to find documents
 – Matching documents against exceptions reports
 – Refiling documents and files
 – Pending or suspense file management
 – Keeping calendars or diaries to track follow-up documents
 – Searching for misplaced or lost files
 – Reconstructing lost files
 – Purging files and removing selected documents for disposition
 – Transporting documents to and from storage rooms or off-site storage
 – Filing internal forms or copies of correspondence
Policies and systems currently in place
Identify the policies and systems that are currently in place:
Has our organization approved the destruction of original paper documents following scanning?
What systems are used for tracking and inventory of paper documents and files?
What ECM or other systems are currently involved in the current scanning or capture operation?
Time frames
Identify the requirements regarding time constraints:
How long does it take for a document to be processed from arrival to completion?
Are there significant differences in time depending on document type? If yes, identify the differences.
What steps in the process take longer than desired?
5.3.2 Processing location requirements
This section identifies where documents originate and how to gather them for processing.
Physical documents
Identify the requirements for physical documents:
How many physical locations create or receive physical documents?
Are the physical documents processed in the location where they are received, or are they moved to a central location for processing?
 – How are they moved: by mail, internal courier, or external courier?
 – Are photo or scanned copies made before they are moved?
Electronic documents
Identify the requirements for electronic documents:
How many physical locations create or receive electronic documents?
Are the electronic documents processed in the location where they are received, or are they moved to a central location for processing?
 – How are they moved: by email, electronic media, file copying, or file transfer?
 – Are copies made before they are moved?
5.3.3 Document type requirements
The questions in this section help to identify the documents types, how they are created, and their characteristics. You must identify and gather single and multiple page samples of all document types.
Identify the requirements for document type:
What are the document types and any subtypes, that we process? Consider the following examples:
 – Packing slips for complete, partial, back ordered shipments
 – Invoices, including purchase order invoices, non-purchase order invoices, preapproved invoices, trade invoices, non-trade invoices, and credit memos
 – Attachments, including shipping confirmation notices and acknowledgement of receipt forms
 – Loan applications, including the application form type by form number
 – Insurance claim, such as the claim form by form number
 – Tax forms, including the form number and year
Who creates the documents?
Can the design of the documents be changed if necessary to increase recognition accuracy?
If documents are created by external parties, approximately how many sources are involved?
What is the input source for each type of document: scanner, fax, email, or other systems?
For each type of document, does it have a fixed number of pages or a variable number of pages?
What is the number of pages per document?
For images, what is the image resolution and format (black and white, color, gray scale)?
What is the input file format for electronic documents?
Do documents contain more than one business transaction?
Do people stamp, mark up, or write on documents as they are processed?
5.3.4 Captured data requirements
Whether you use recognition technology or manually type the data, you must identify the characteristics and processing details of the data on your documents. With this information, you can determine the data recognition requirements and other aspects of handling the data, including validations, lookups, verification, indexing, and data entry.
Identify the requirements for captured data:
What fields should be manually entered at the batch level (for example, Scan Date, Expected Number of Documents, or Expected Number of Pages)?
What fields should be captured at the document level (for example, Invoice Number, Invoice Date, or Invoice Total)?
What fields should be captured at the line item detail level (for example, Item ID, Item Quantity, or Item Price)?
For each document type, is data primarily machine printed or hand printed?
For hand printed documents, is the print constrained or unconstrained?
Are there pages that do not have data that must be recognized, such as attachment pages? It is common for forms to have instruction pages that are scanned but that do not have data on them.
How is data located on the pages where you need to use recognition to read the data?
 – Fixed form layout. Fields are on specific zones where the location can be used to find the data.
 – Variable form layout. Fields have text labels where a search for the text label can locate the field.
 – Data is contained in a barcode.
Is data validated by using an external database?
What are the business rules for validating the values of the fields?
Do fields have lists of valid values?
Is it data optional or required?
Does the data printed on the page conform to a repeatable pattern? (For example, Credit Memo Number starts with the letters CR followed by six numerics, a hyphen, and three numerics.)
5.3.5 Verification requirements
Verification intersects users with the documents. You must understand where these users are located and what tasks they are authorized to perform on each type of document. Business rules need to be applied that might mirror existing practices for handling paper-based data entry. Verification might also be desired as a quality control step to ensure that every image is readable.
Identify the requirements for verification:
Will verification be handled in a central location or from remote locations?
Are there business rules or policies that will require multiple verification steps?
Who will perform verification?
Does verification need to restrict access to specific document types by different groups of users?
Do we need to display every document or page or can we display only documents or pages were we have exceptions?
Will some documents require manual page identification by an operator?
Based on the information gathered on input documents, captured data, and export requirements, how should low confidence data, invalid data, unidentified documents, and incorrectly identified pages be handled?
When recognition results are high confidence, do you want an operator to view the document anyway?
Do operators need to visit all fields with low confidence characters?
Under which circumstances can the operator split out a document from the batch to finish processing the other valid documents in the batch? How should the split-out documents be handled?
Should operators be able to mark document for deletion (documents will not be exported)?
Should deletion trigger a follow-up process or automatic notification?
5.3.6 Export requirements
Identify the format, content, and target system or systems of the data and images for export:
What is the document format for the exported documents for each type of document: TIFF, PDF, PDF with text, PDF/A, original input format, or other?
Is the original image or the enhanced image used for export?
Are color and gray scale images to be exported as color and gray scale?
Is a specific file naming convention needed for the exported document?
What are the document properties of the exported documents?
Do the images have areas that need to be redacted?
What are the target application systems for the exported data?
What are the interfaces that are available in the target systems for ingesting the data?
What data fields are exported to target application systems?
Does the data need to be reformatted to accommodate the needs of the target application?
5.3.7 Volume and timing requirements
You must size and appropriately install and configure various Taskmaster components (remote/local clients, background processing, Fingerprint Service, Fingerprint Maintenance Tool). To assist with this task, create a matrix based on the following information:
Sources of input
Input volumes for each source
Approximate number of unique documents (number of fingerprints)
Peak periods
Timing (processing windows) requirements
Identify the volume and time requirements:
What are the input sources: scanner, fax, email, or other systems?
How many document or image files are processed per day from each source?
How many documents per document type are processed per day?
For highly variable documents, such as invoices, how many different document formats are processed?
What is the peak volume of documents and image files? Are there peak processing cycles daily, weekly, monthly, or annually?
Are there peak volume requirements per day, or are there specific service level agreements about how quickly documents will be processed?
Do existing paper files need to be scanned (backfile)? Quantify the volume and time frame for digitizing. Are the processing requirements different for historical documents compared to new documents?
Is the ability to prioritize batches or to change the sequence in which the batches are processed required?
What are the availability requirements for the system?
5.3.8 Administration requirements
Identify the administration-related requirements:
What are the production reporting requirements? Compare Taskmaster standard reports to determine if custom reports formats are needed.
What information needs to display for job monitoring? Compare Taskmaster monitoring views to determine if additional fields are required.
What is the organization model for administering the system?
What are the security requirements for authentication?
How does the organizational security model map to the security requirements for the documents and capture functions?
5.4 Designing the capture for the auto claims scenario
In this section, we apply the design elements to the scenario that were introduced in Chapter 4, “Solution example” on page 123. We produce a high-level design that we implement in the following chapters. We describe the document hierarchy and the capture processing tasks that we implement.
5.4.1 Document hierarchy
The design document lists the expected document types and page type that we process in our application. We need to note any special document structure requirements. For our application, we considered three potential document types: claim documents, estimate, and invoice documents.
Claim document
A claim document is a form that is generated based on a conversation with the customer over the phone or through an online form. It documents information about the customer and the incident. The customer must sign the form and return it to the main office, by fax, scan, or email.
The format of the claim document is fixed. We control the layout of the document and have designed a single page for the customer to sign the form and date the signature date. We print a barcode on the claim page so that we can recognize the type of page and record the claim number.
Claim pages and claim attachment page
The claim document contains a claim page. It is the first page of a claim document, and each document has only one claim page.
Although we produce only a single page claim, we must be prepared to handle more pages. Customers often attach additional pages with additional explanations to the claim form. To accommodate this situation, we added a second page type, a claim attachment page. When we process a claim document, we expect only one claim page or a claim page with one or more attachment pages. The claim page is the first page of the document.
Estimate and invoice document
When we examined the estimate and invoice documents, we discovered that these two documents were identical. They were both produced by the same system. The only significant difference is that the estimate document has a title indicating that it is an estimate and the invoice document has a title indicating that it is an invoice.
As a result, we decide to define two types of documents within the Taskmaster system: claim and invoice-estimate. When we export, we store estimates and invoices in separate document classes in FileNet Content Manager.
Estimates
When a repair shop evaluates the damage to a vehicle, it prepares an estimate of the costs of the repair. The repair shop creates this document, in which the format is subject to change without notice. We have no control over the layout or contents. When a repair shop prepares an estimate, it is printed from their system in the format that they choose, which can be a multipage document. However, it will only contain one estimate per document.
We can also have our own adjusters produce repair estimates. However, for this scenario, we want a more difficult situation where the document format is beyond our control so that we can show how to handle the highly variable format.
Invoices
After a repair shop completes the repairs to a vehicle, it prepares an invoice that details the actual cost of the repair. Similar to the estimate, the repair shop generates this document, in which the format is subject to change without notice. We have no control over the form layout or contents. This document can be a multi-page document. However, it will only contain one invoice per document.
Estimate and invoice pages: Main page and trailing pages
The estimate and invoice documents have two types of pages: main page and trailing page. The main page is the first page, and only one main page exists. Each estimate can have many trailing pages. All of the pages have data that we can extract if we want to extract the detail line items. In this example, we extract the header data.
Fields
The data fields are designated for each page. The data fields are objects that are attached to a page. We can have a particular data field on several pages, but a particular data field can only be used once per page. When you attach rules to a field level object, and then use this field on an additional page, it immediately inherits all pre-existing rules.
Table 5-2 lists the data fields that are expected on the claim pages.
Table 5-2 Claim page fields
Field
Taskmaster field name
Description
Page source
Populated from
Incident Number
IncidentNumber
A number assigned by our company to identify this specific damage claim
We encode this number in a barcode and print it on the page.
Barcode
Policy Number
PolicyNumber
The insurance policy of the customer or contract number
We preprint this information on the page.
Database lookup
Policy Effective Date
PolicyEffectiveDate
The date when the insurance coverage begins for this policy
We preprint this information on the page.
Database lookup
Policy Expiration Date
PolicyExpiryDate
The date when the insurance coverage ends for this policy
We preprint this information on the page.
Database lookup
Insured Name
InsuredName
The name of the person who has insurance coverage from this policy
We preprint this information on the page.
Database lookup
Chassis Number VIN
ChasisNumber_VIN
A number that uniquely identifies the vehicle and is assigned by the vehicle manufacturer
We preprint this information on the page.
Database lookup
Injury
Injury
Indicates that an injury was associated with the incident
If an injury has occurred, the insured selects the appropriate check box on the form.
OMR
Vehicle Make
Veh_Make
The make of the vehicle (manufacturer)
We preprint this information on the page.
Database lookup
Vehicle Model
Veh_Model
The model name of the vehicle
We preprint this information on the page.
Database lookup
Vehicle Color
Veh_Color
The color of the vehicle
We preprint this information on the page.
Database lookup
Insured Signature
InsuredSignature
The handwritten signature from the insured
The insured writes their signature on the page.
OMR
Signed Date
SignedDate
The handwritten date from the insured
The insured writes the date on the page.
ICR
Table 5-3 lists the fields that are expected on the estimate and invoice pages.
Table 5-3 Estimate and invoice page fields
Field
Taskmaster field name
Description
Page source
Populated from
Document Title
Doc_Title
Estimate or Invoice
Printed on the page
OCR page text, search using regular expression to determine Estimate or Invoice
Policy Number
Pol_Number
The customer’s insurance policy or contract number.
Not printed on the page
Database lookup
Claim Number
Claim_Number
A number assigned by our company to identify this specific damage claim.
Can be printed on the page
OCR or typed if not printed on the page
Vendor Number
Ven_Number
An identification number for the Repair Shop from our company vendor table.
Not printed on the page.
Database lookup and then matched from the Fingerprint ID
Reference ID
Ref_ID
The estimate or invoice number from the repair shop
Printed on the page
OCR
Reference Date
Ref_Date
The estimate or invoice date
Printed on the page
OCR
Reference Total
Ref_Total
The total amount of the estimate or invoice
Printed on the page
OCR
5.4.2 Capture processing tasks
The application has the following capture processing tasks:
1. Document acquisition
For our project, we implement centralized fax import, scanning, and distributed web-based scanning. Claim forms are mailed or faxed to our company. Faxes are imported from a central file system location, which is in the mailroom. Agents scan by using a web-based scanner.
Email capture was out of scope for our project but might be implemented by using the capabilities of the product.
2. Image enhancement
Images are enhanced by using the image enhancement features of Taskmaster that we are using to deskew the images and drop out horizontal and vertical lines.
3. Page identification
Claim pages are identified with a barcode printed on the form that contains a form identification code.
Claim attachment pages are identified by using rules. A claim attachment page is any page that follows a claim page.
Estimate and invoice pages are identified with a separator page containing a barcode. Alternatively, they are received to a specific fax number that is dedicated to receiving invoices and estimates from vendors.
4. Document assembly
Documents are assembled by using document integrity rules that are expressed in the document hierarchy.
5. Recognition
Claim data is recognized by using barcode recognition: OCR/S to read machine print text and ICR/C to read handwriting.
Expense and invoice data is recognized by using OCR/S.
6. Fingerprinting
We use fingerprints to differentiate between multiple formats of the same page type.
The Intellocate actions are used to automatically make new fingerprints if needed. In our use case, we described the estimates and invoices that came in from the repair shop. Each time we see a new layout for an estimate or invoice, we will capture it and make a new fingerprint.
7. Locating data
Claim data is located by using zones. Expense and invoice data is located by using Intellocate.
8. Validation
We do database lookups to an IBM DB2 database to validate the incident number and policy data, from the fields on the claim page.
We validate the vendor information with database lookups to a DB2 database.
We set a threshold for recognition confidence so that low confidence characters and fields are flagged.
9. Routing
Exceptions are routed to a verify operator. Items that require follow-up by customer service or agent are routed by using the business process workflow for resolution.
10. Verification
Verification occurs at a central location. Low confidence results or validation errors are presented to operators.
11. Export
For our use case, we use a combination of PDF/A and TIFF file export. The claims are stored in TIFF, while the estimates and invoices are stored as PDF/A documents. We revert to the original image because it was scanned before enhancement.
We export the documents to the FileNet Content Manager. We include corresponding document classes that are defined with document properties for the claim, estimate, and invoice documents where they are stored and processed by the business.
Extracted data is exported to an XML file that is stored in a designated folder on our file server.
To implement the solution, follow the step-by-step instructions in the next chapters:
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset