Adding a document type to an existing application
This chapter explains how to add a document to an existing Datacap application. It also explains how to validate the fields within the document, routing, creating verify panel, and exporting.
This chapter includes the following sections, which outline the process for adding documents, specifically the Estimate and Invoice documents, to an existing application:
7.1 Adding VScan to a rule set
To set up the VScan task, complete these steps:
1. In Document hierarchy tab, expand the application. In this example, we expand PIE_RBooks.
2. In the Rulesets tab, click the VScan rule set.
3. Click the padlock icon.
4. Expand the VScan rule set.
5. Set the directory and the number of image files to process:
a. Set the image file to 1.
The SetMaxImageFiles() action is used here. Setting the value to 1 gives you one image for each batch. If documents are coming from a fax, set the value to 1. If the documents are scanned from a physical scanner, batches of 50 or 100 image files is preferred. The setting depends on the source where the documents come from.
b. Add the VScan SetMultiPageTiff from the Actions library. The scan action must be last in the rule set.
c. Click Save.
6. Click Publish to save the rule set.
7.2 Adding a document with pages and fields
Now you can add a document to an existing project. This task includes adding the potential page types that this document might have and the fields that you might want to capture for each page type if applicable.
For the example in this book, we create a document, called Estimate_Invoice, with the following page type and field information:
Document name: Estimate_Invoice
Potential page types:
 – Main_Page
 – Trailing_Page
Potential fields to be captured in the Main_Page
 – Doc_Title
 – Pol_Number
 – Client_Number
 – Van_Number
 – Van_Name
 – Ref_ID
 – Ref_Date
 – Ref_Total
Potential fields to be captured in the Trailing_Page: None
To add the new document to the existing project, complete these steps:
1. Open Datacap Studio.
2. Click the Lock DCO for Editing icon.
3. Right-click Batch Level, and select Add → Document.
A new document, labeled Document 1, is added.
4. Rename the document to Estimate_Invoice.
5. Add the potential pages to this document, which contains two potential types of pages:
a. Right-click the document, and then select Add Multiples.
b. Select Pages, and then enter 2.
Two new pages, labeled Page1 and Page2, are added.
c. Rename Page1 to Main_Page, and then rename Page2 to Trailing_Page.
6. Add fields that can be captured by the page. Complete the following steps for Main_Page:
a. Right-click Main_Page, and then select Add Multiple.
b. Select Fields, and enter 8, which specifies that potentially eight fields must be captured for this page type.
c. Rename the fields to Doc_Title, Pol_Number, Client_Number, Van_Number, Van_Name, Ref_ID, Ref_Date, and Ref_Total (Figure 7-1 on page 256).
7. Click Save.
8. Click the Unlock DCO for Editing icon.
Figure 7-1 Document hierarchy tab
7.3 Setting the PageID logic
In Datacap Studio, set the PageID logic:
1. On the Rulesets tab, expand the PageID rule set for PIE_RBooks.
2. Add the rrCompare() action to Page from Fax Job in the PageID rule set.
3. Configure the rrCompare() action to check if the value FAX is found at the variable level job type:
a. Highlight the rrCompare() action that you just added.
b. On the Properties tab, for string object1, enter FAX. Then for string object2, enter @B.JobType.
4. Add an action, SetDCOType(), that will be triggered if the rrCompare() action is true:
a. Add the SetDCOType() action to the Page from Fax Job.
b. On the Properties tab, for the StrParam string, enter Main_Page.
5. Add a function, Function 1, to the PageID rule set:
a. Right-click PageID, and then select Add New Function. Function 1 is added to the bottom of the list.
b. Click the up arrow to move the new Function 1 to the top.
c. Add the ChkLastDCOType() action from the Actions library tab to Function1.
d. On the Properties tab, for the StrParam string, enter Main_Page.
e. Add the SetDCOType() action from Actions library tab to Function1. This action runs after the ChkLastDCOType() action.
f. On the Properties tab, for the StrParam string, enter Trailing_Page, which sets the current page type to Trailing_Page.
6. Add another function, Function 2, to the PageID rule set:
a. Right-click PageID, and select Add New Function. Function 2 is added to the bottom of the list.
b. Click the up arrow to move the new Function 2 under Function1.
c. Copy and paste the actions from Function1 to Function 2.
d. Change the parameter for the newly copied ChkLastDCOType() action in Function 2 to Trailing_page.
7. Rename Function 1 to Looks for Fax Trailing.
8. Rename Function 2 to Looks for additional Fax Trailing.
9. Save and publish this rule set.
The functions added to the PageID are triggered sequentially. For example, a page comes in from a fax machine. Initially, the page type is Other, causing the first two functions, Looks for Fax Trailing and Looks for additional Fax Trailing, to fail. The Page from Fax Job function is triggered because the page came from the fax. It sets the page type to Main_Page. If additional pages come from the fax immediately following the first page, the Looks for Fax Trailing function checks whether the previous page was the main fax page (with the page type of Main_Page).
If the previous page is of the Main_Page page type, it sets the page type to Trailing_Page. If additional pages proceed, the Looks for additional Fax Trailing function is triggered because the last page had a Trailing_Page type. The additional pages are all set to the Trailing_Page page type. Every additional page that comes in from this fax action is labeled as a Trailing_Page, based on the Looks for additional Fax Trailing function.
If a page comes in that is from a scan instead of a fax, the Looks for Fax Trailing, Looks for additional Fax Trailing, and Page from Fax Job functions fail. They trigger any other functions that we might have set as part of this rule set, based on the source and functions we might have.
7.4 Adding the Image Enhance feature to the pages
After creating page types for your document, you can configure Datacap to clean up the images of these page types by adding the Image Enhance feature to them.
On the Rulesets tab, look at the Enhance_Image settings in ImageFix. These settings are set at the batch level. We must add this rule to the newly created page types, which are the Main_Page and Trailing_Page pages in this example.
To add the Image Enhance feature to the page types, complete these steps:
1. On the Document hierarchy tab, click the Lock DCO for Editing icon.
2. Navigate to the object to which you are changing, which is Main_Page in this example.
3. Verify that the Enhance_Image rule set is highlighted, and then click Add to DCO.
4. Repeat steps 2 and 3 for the Trailing_Page page type.
Now when pages of either the Main_Page or Trailing_Page type are triggered, the imaging enhancement runs as part of it.
7.5 Configuring the CreateDocs rule set for the pages
Next we configure the CreateDocs rule set. For this rule set, we do not need to change any of its components. The CreateDocs rule set contains two rules: the Create Docs rule and the Create Fields rule.
The Create Docs rule is part of the batch-level object. This rule contains a create document action, which examines all the pages in the batch. When it detects a Claim_Pg page, for example, it sets the start of a new document.
The Create Fields rule set triggers at each page that has fields attached. The rule set creates the fields in the Claim_Pg page.
We add the Create Field rule to the Main_Page under the Estimate_Invoice document by using the following steps:
1. On the Document hierarchy tab, highlight the Main_Page object.
2. On the Rulesets tab, highlight the rule Create Field.
3. Click Add to DCO to add the rule to the page.
4. Right-click Main_Page, and then select Manage Variables.
5. Under Object General information, complete these steps:
a. Change Max value to 1.
b. Change Min value to 1.
c. Change Order to 1.
d. Click Done.
These settings dictate that every Estimate_Invoice has a minimum of one main page (Main_Page page type), a maximum of one main page, and no more than one main page for each document.
For the trailing pages (pages of type Trailing_Page), we keep the default settings to 0 for Min, Max, and Order because we can have multiple trailing pages for each document or we might not have any trailing pages.
6. Save and unlock the Document Hierarchy (DCO).
7.6 Adding a full page OCR rule set to the pages
We have now completed the setup for the new Main_Page document with the fields and the page structures. We have added the PageID, ImageFix, and the CreateDocs rule sets. Now we add the pages into a document structure and start the recognition process (recognize action).
For the recognize action, we have the Recognize Page rule set, which has the RecognizePageOCR() action. This rule can be used for the Main_Page that we created.
To add a full page Optical Character Recognition (OCR) rule set to the pages, complete these steps:
1. Lock the Document hierarchy tab for editing.
2. Select the Main_Page page.
3. On the Rulesets tab, select the RecognizePageOCR() rule, and then click Add to DCO.
4. Select Trailing_Page.
5. Select the RecognizePageOCR() rule, and then click Add to DCO.
Now when this rule set is run, it does a full page OCR on all the pages and create a CCO for each of these. Then we can use the CCO to extract data out of it.
7.7 Setting the fingerprint for the Estimate_Invoice document
The next rule set is the fingerprint rule set, which was previously set. This rule includes Batch Level Fingerprint Settings, which are tied to the batch level. This rule set contains the following basic actions:
SetFingerprintsDir(), which identifies where the fingerprints are stored
SetProblemValue(), which indicates the level of match that we need to have a valid match
SetFingerprintSearchArea(), which indicates the percentage of the page we want to look at
The second rule, Claim Page Fingerprint Matching, is attached to the claim page (Claim_Pg). In this rule, the FindFingerprint() action is set to true, which we switch to false. Because this rule is on the claim page, we do not want the system to create fingerprints by default.
This rule was originally used as part of the claim page, which is a fixed form that we used while setting the main system up for the first time. Now that the system is set up, we change the FindFingerprint() action to false.
For the Estimate_Invoice document, which is a form coming from an outside source, we do not have control of the form. This form is also a live form, in the sense that it might be constantly changing. For the Estimate_Invoice document, we set FindFingerprint() to true. In return, if this form is new, it allows the system to learn it instantly.
To add the FindFingerprint() rule set, complete these steps:
1. Verify that the Document hierarchy tab is locked for editing.
2. On the Rulesets tab, lock the Find Fingerprint rule set.
3. Right-click the Claim Page Fingerprint Matching rule, and then select Copy.
4. Right-click the Find Fingerprint rule set, and then select Paste to add a copy of the rule.
5. Rename the new rule to Main Page Fingerprinting. This rule is attached to the main page.
6. Highlight the Main_Page, and click Add to DCO.
7. Select the FindFingerprint() action from the Claim Page Fingerprint Matching rule.
8. On the Properties tab, in the parameter section, change the value to False.
9. Save changes and publish the rule set to update the rule set with the new information.
7.8 Obtaining field values in the Claim_Pg document by using the Locate() rule set
The Locate() rule set is used to find the data on the page. On the Claim_Pg, we were used a generic action that worked with known, fixed-field locations. The Locate() method we used searches for data that is in a fixed location. It also searches for data that might be in different places on the page.
We create a separate rule for every data field on the main page:
7.8.1 Doc Title field
The Document Title field is a drop-down list box that contains the choices of Invoice or Estimate. The document title is used to identify the type of forms we process. To populate the field, use a regular expression search for the word “Estimate.” If the word is found, we set the field value to Estimate. Otherwise, the default is set to Invoice.
7.8.2 Pol_Number field
The Pol_Number field contains the policy number. This field is special in that it is not populated through OCR or Intelligent Character Recognition (ICR). This field is populated during the Verification phase.
7.8.3 Claim_Number field
The zonal recognition method tries to match the Claim_Number field with a known location from a known fingerprint by using the PopulateZNField() action. If the current image matches with a known fingerprint, it goes to that zone and then extracts the value.
7.8.4 Vendor Number field
To populate the Vendor Number field, we match against a known fingerprint. That value is then used to look for a corresponding value in a database. If this time is the first time the form is seen, or the form does not match any known fingerprint, this field is blank and requires manual interaction during the Verify process. The Vendor Number field requires a valid Vendor Name field value.
7.8.5 Vendor Name field
To populate the Vendor Name field, we match against a known fingerprint. That value is then used to look for a corresponding value in a database. If this time is the first time the form is seen, or the form does not match any known fingerprint, then this field is blank and requires manual interaction during the Verify process.
7.8.6 Ref ID field
The zonal recognition method tries to match this field with a known location from a known fingerprint by using the PopulateZNField() action. If the current image matches a known fingerprint, it goes to that zone and then extracts the value.
7.8.7 Ref Date field
The zonal recognition method tries to match this field with a known location from a known fingerprint by using the PopulateZNField() action. If the current image matches a known fingerprint, it goes to that zone and then extracts the value.
7.8.8 Ref Total field
The Ref Total field has a unique aspect to it in that, although it is always part of the document, its position on the document can vary from form to form. On some forms, the total might not be displayed on the first page. To handle this issue, we incorporate two new features into our project, the MergeCCOs_byType() and FindLastKeyList() actions.
We add the MergeCCOs_byType() action at the document level. This action takes the CCO data from the main page (Main_Page) with all of the trailing pages (Trailing_Page) within the document, making one multipage CCO.
We add the FindLastKeyList() action at the field level. This action calls a text file ending with the .key extension. Then it starts with the first word in the list and the bottom of this multipage CCO file to look for a match. If the first word does not match, it uses the second word, then the third word, and so on.
7.9 Looking up vendor information from a database by
using the Lookup() rule set
The Lookup rule set retrieves data values from the database. For this example, we want to look up the vendor name and vendor number.
7.9.1 Looking up a vendor name
After a successful fingerprint match, we now know the template ID for the matching fingerprint. To look up the vendor name, we create a rule set that uses the following rules:
1. At the batch open, the first rule establishes the connection to the fingerprint database.
2. Set on the field, the second rule uses Main_Page.TemplateID to get the matching fingerprint class:
"SELECT Host.hs_RefName FROM Host INNER JOIN Template ON Host.hs_HostID = Template.tp_HostID WHERE (((Template.tp_TemplateID) LIKE '%s'));”
This rule looks up the template ID from the fingerprint database. Then, based on the template ID, it returns the matching hs_RefName (fingerprint class) that the template is stored under. If a new fingerprint was created during the Find Fingerprint rule set, this value is <New>.
3. At the batch close level, the third rule closes the connection.
Placing these rules this way allows for a more efficient use of the processing sequence.
7.9.2 Looking up a vendor number
The vendor number lookup is similar to the Lookup rule set. The main differences are the database to which it connects and the SQL statement that it executes. With this rule set, we want to achieve the following goals:
Validate the vendor name with our known acceptable vendors
Retrieve our internal number that we assigned to this particular vendor
Like the Lookup rule set, we have the following rules:
1. At the batch open level to establish the connection.
2. At the Vendor Number field to execute the SQL statement to retrieve the number based on the Vendor Name field.
3. At the batch close level to close the connection.
The actual SQL statement might vary based upon the database to which you connect.
7.10 Validating data
Data validation is when you determine whether the data you have captured meets the rules for data integrity as defined in the business requirements. For example, when we established the business rules for the application, we decided to test whether the cost fields are in a valid currency format. A validation failure does not mean that the original page contains invalid data. It might mean that the recognition engine failed to recognize one or more characters correctly. Whatever the reason for the error, you can set the page status to make sure that the page is displayed to an operator for verification.
This section highlights the validation techniques that we use in our use case. We validate against the following objects:
7.10.1 Page Level field
At the page level, we add a rule that contains the StatusPreserveOFF() action. With this action, the page status can be controlled by the fields. When a field fails its validation rules, the field status is set to 1. When this failure happens, it causes the page status to also be set to 1.
7.10.2 Doc Title field
By default, the Doc Title field is set to Invoice, unless during the Locate phase, the document is determined to be an Estimate_Invoice. No validation is available for this field.
7.10.3 Pol_Number field
The Pol_Number field has a few basic validation rules that we need to follow. First, we use the allowOnlyChars() action with the 1234567890CL- parameter. This parameter removes any character that is not in the list. Next we check the minimum length and the maximum length by using two actions for our use case: IsFieldLengthMin() with a parameter of 4 and IsFieldLengthMax() with a parameter of 10.
7.10.4 Vendor Number field
The validation for the Vendor Number field involves checking to see if it is populated. If it is populated, it does a database lookup to see if this value and the value from the Vendor Name field matches in the database.
7.10.5 Vendor Name field
The validation for the Vendor Name field involves checking to see if it is populated. It also entails doing a database lookup to see if this value and the value from the Vendor Number field match in the database.
7.10.6 Ref ID field
Because the value of this field is controlled by the business that is creating the estimate and invoice forms, our validations here are limited. We can use an IsFieldLengthMin() action with a parameter of 3 to ensure that the field is populated. We can also use the IsFieldPercentNumeric() action and set the parameter to 60. This setting helps to ensure that the field contains at least 60% numeric characters.
7.10.7 Ref Date field
The Ref Date field is the date on the estimate or the invoice form. This date must be in a MM/DD/YYYY format. We can enter this date by using the IsDateWithReformat() action and the MM/DD/YYYY parameter. We can also check to ensure that the date is recent, but not a date in the future, by using the IsFieldDateWithinXRange() action and setting the parameter to 90. If the date is in another format, the action normalizes the date into this format.
7.11 Routing scanned pages
After the unattended validation runs, the Routing rule set helps to determine which pages will be presented to the Verify operator for manual validation. Within the Validate rule set, every page had a status of 0 or 1 assigned to it, where a 0 means it passes all field level validations, and a 1 means it failed on one or more field validations.
The Routing rule set checks the page status value. If the page status is set to 1, no more additional checks are done because this page is already marked for operator intervention. When the page status is to 0, the Routing rule applies. The Routing rule then runs an action that examines the character confidences of every character on the page. If it finds any confidence values that are less than 8, it sets the page status to 1, marking this page as verified.
7.12 Verify task with Batch Pilot
During verification, you display pages to an operator for manual checking and possibly correction. You display pages to an operator for two reasons:
A page contains one or more characters or OMR fields that were marked “low confidence” by the recognition engine.
A validation rule failed, indicating a problem with the integrity of the data.
This section covers the verification user interface, Batch Pilot. The rules that are used here are described in 7.10, “Validating data” on page 264. We want to reuse the same rule to ensure that the reason that the page went to the Verify operator has been addressed. We also want to ensure that the operator does not introduce a new error while working on the page.
7.12.1 Creating a Verify panel
To create a Verify panel for the Estimate and Invoice forms, complete these steps:
1. Select Start → All Programs → Datacap → Batch Pilot → Batch Pilot to launch the Batch Pilot program.
2. Select File → Open Project.
3. Select the rrs_verify.bpp project, and then click Open.
4. In the bottom section of the Batch Pilot window, expand UPDATE → UPDATE WITH DOCUMENT.
5. Right-click the UPDATE WITH PAGE page, and select AutoForm. AutoForm reads the document hierarchy (setup DCO), and an image snippet control and an edit or list box control are displayed for each of the defined fields.
6. Select File → Save Form.
7. Open the UPDATEverify folder, and then save the form as Estimate.dcf.
8. In the bottom section of the Batch Pilot window, right-click the UPDATE page, and select Pick form.
9. From the UPDATEverify folder, select Estimate.dcf, and click Open to link the form to the page type.
10. Click File → Save Project.
7.12.2 Determining which pages to display
By default, Batch Pilot displays all pages to the operator, regardless of whether a given page has a problem. In this section, we configure Batch Pilot to display only pages with status of 1, which indicates a problem.
To display only pages with a status of 1, complete these steps:
1. In the Taskmaster Client window, click Administrator
2. On the Workflow tab, expand Main Job → Verify and then select Setup.
3. In the Batch Pilot window, click File → Task Settings.
4. Click the Filters tab.
5. Under Type, select Main_Page. Then select Level: PAGE, Property: STATUS, Problem Value: 1 and click Add. Now if we are in the Verify panel and the page has a status of 1, indicating a problem, then the page is displayed to an operator. Click OK.
6. Click File → Quit.
7. Click Yes to save the project.
8. Click Apply, and then click Done to close the Administrator window.
7.13 Export task
Taskmaster can export data to a text file, an XML file, a database, a document management system, or a custom business process. The default output format is a text file. This section looks briefly at the actions that you can use to export data to a document management system, IBM FileNet P8, and a database. For the document management system, we have a use-case requirement that all of the estimates and invoices are sent to FileNet P8 as text searchable PDFA documents.
Making a PDFA per Estimate and Invoice document
Now that we have extracted the data from the form and verified that it is correct, we can move to the next task, Export. The use-case scenario requires that we create a single PDFA file for each estimate or invoice document regardless of how many pages are in the document. Because the pages are now broken into documents, we can run a rule at the document level to create the PDF:
1. In Datacap studio, on the Rulemanager tab, on the Rulesets subtab, right-click Auto_Claim, and then select Add Ruleset. This step adds a rule set, called Ruleset1, to the bottom of the list.
2. Rename Ruleset1 to Estimate Invoice PDF creation.
3. Select Function1. From the OCR_S actions library, add the action RecognizeDocToPDF() to it.
The RecognizeDocToPDF() action has five available parameters, with a numeric value indicating the Document Format type:
1 A PDF with an image on text
2 A PDF where Graphics, KeepBold, KeepItalic, and KeepUnderline have an effect
3 A PDF with image substitutes where Graphics, KeepBold, KeepItalic, and keepUnderline have an effect only
4. A PDF with image on text where Graphics, KeepBold, KeepItalic, and KeepUnderline have an effect only
5 A PDF image only.
We choose the first option. These options come directly from the help file of the action.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset