Chapter IV.3. String Searching

Searching for data is one of the most common functions in writing a computer program. Most searching algorithms focus on searching a list of values, such as numbers or names. However, there's another specialized type of searching, which involves searching text.

Searching text poses unique problems. Although you can treat text as one long list of characters, you aren't necessarily searching for a discrete value, like the number 21 or the last name Smith. Instead, you may need to search a long list of text for a specific word or phrase, such as ant or cat food. Not only do you need to find a specific word or phrase, but you also may need to find that same word or phrase multiple times. Because of these differences, computer scientists have created a variety of searching algorithms specifically tailored for searching text.

Warning

Computers only recognize and manipulate numbers, so every computer represents characters as a universally recognized numeric code. Two common numeric codes include the American Standard Code for Information Interchange (ASCII) and Unicode. ASCII contains 256 codes that represent mostly Western characters whereas Unicode contains thousands of codes that represent languages as diverse as Arabic, Chinese, and Cyrillic. When searching for text, computers actually search for numeric codes that represent specific text, so text searching is ultimately about number searching.

Note

One of the most popular uses for text searching algorithms involves a field called bioinformatics, which combines molecular biology with computer programming. The basic idea is to use long text strings, such as gcacgtaag, to represent a DNA structure and then search for a specific string within that DNA structure (such as cgt) to look for matches that could indicate how a particular drug could interact with the DNA of a virus to neutralize it.

Sequential Text Search

The simplest text searching algorithm is the brute force sequential search, which simply examines every character. To look for the string gag in text, the brute force sequential search examines the text character by character. The moment it finds the letter g, it checks to see whether the next letter is a and so on. To find anything, this search algorithm must exhaustively examine every character, as shown in Figure 3-1.

In searching for the string GAG, a brute force search starts with the first character and finds a matching G character. Next, it checks whether the next two characters are an A and a G.

Sequential search examines every character.

Figure IV.3-1. Sequential search examines every character.

In this example, the third character (T) doesn't match, so the brute force algorithm starts all over again by examining the second character, even though it had previously examined that character. Because the brute force method examines every character individually, this method is the slowest and least efficient method for finding text.

Although the brute force method works, it can take too much time, especially when searching through large amounts of text. To make text searching faster and more efficient, computer scientists have developed a variety of alternative algorithms.

The Boyer-Moore algorithm

To speed up text searching, the computer should skip any previously examined text and jump straight to unexamined text. That's the basis for a text searching algorithm developed by two computer scientists (Bob Boyer and J. Strother Moore) called the Boyer-Moore algorithm.

Like the brute force algorithm, the Boyer-Moore algorithm examines text character by character. After the Boyer-Moore algorithm finds a partial match, it's smart enough to skip over previously examined characters and start searching from the end of all examined characters, speeding up the entire search process, as shown in Figure 3-2.

The Boyer-Moore algorithm skips over partially matched characters.

Figure IV.3-2. The Boyer-Moore algorithm skips over partially matched characters.

The Rabin-Karp algorithm

Although much faster than a brute force search, the Boyer-Moore algorithm still searches one character at a time. If you're searching for a text string, you can speed up the search by examining blocks of text rather than individual characters.

For example, if you're searching for the string GAG, you could examine three characters at a time rather than examining a single character three times. To make searching blocks of characters faster, two computer scientists (Michael O. Rabin and Richard M. Karp) created the Rabin-Karp algorithm.

This algorithm uses a hash function to convert a block of characters into a numeric value. Instead of examining individual characters, the Rabin-Karp algorithm uses its hash function to convert the original search string into a numeric value. So a hash function might convert the three-character string to search (GAG) into a numeric value of 3957.

After converting the search string into a numeric value, the Rabin-Karp algorithm repetitively searches for blocks of characters that are the same length of the search string (such as three-characters) and uses its hash function to convert those blocks of text into a numeric value. Now instead of searching for matching characters, the Rabin-Karp algorithm searches just for matching hash values, as shown in Figure 3-3.

The Rabin-Karp algorithm searches for hash values.

Figure IV.3-3. The Rabin-Karp algorithm searches for hash values.

The key to the Rabin-Karp algorithm is the speed and method of its hash function. If the hash function can create values quickly and insure that different strings never create the same hash value, this algorithm can run quickly. If the hash function calculates hash values slower than the computer can examine characters individually, this algorithm may run slower than another algorithm, such as the Boyer-Moore algorithm. Also if the hash function calculates identical hash values for two different strings, this algorithm won't be accurate enough because it finds the wrong data.

The Shift Or algorithm

The Shift Or algorithm takes advantage of the fact that computers are much faster manipulating 1s and 0s than they are in manipulating and comparing characters. First, the algorithm creates an empty array the same length as the text that you want to search. Then, it compares the first character of the target string (what you're trying to find) with each character in the search string. Every time it finds a match, it stores a 0 in the array element. Every time it doesn't find a match, it stores a 1 in the array element, as shown in Figure 3-4.

The Shift Or algorithm creates an array of matching characters.

Figure IV.3-4. The Shift Or algorithm creates an array of matching characters.

After creating an array by comparing the first character with each character in the search string, the algorithm next looks only for a zero (0), which identifies where the first character of the target string was found. Now it compares the character to the right. If a match is found, it stores a 0 in a second array that represents matching the second character of the target string. If a match isn't found, it stores a 1 in this second array.

The algorithm repeats this step for each character in the target string, eventually creating a two-dimensional array of 1s and 0s, as shown in Figure 3-5.

When searching for the first character (G), in Figure 3-5, the algorithm must check every character in the entire string. However, when searching for the second character (A), the algorithm only has to look for the 0s in the previous row of the two-dimensional array, which identifies where the G character appears. In Figure 3-5, this means only searching three characters out of the entire string to look for a possible match of the GA string.

When searching for the third character (G), the algorithm now only checks for 0s in the second row of the two-dimensional array, which means it only checks three characters out of the entire string.

The Shift Or algorithm creates a two-dimensional array.

Figure IV.3-5. The Shift Or algorithm creates a two-dimensional array.

As soon as the algorithm finds three 0s that form a diagonal line in the two-dimensional array, it can identify the exact location of the GAG string in the much larger string. The Shift Or algorithm gets its name because the matching string patterns look like binary numbers where the 0 constantly gets shifted one place to the right, like this:

G →011

A →101

G →110

Warning

Although the shift or algorithm takes more steps than a simple brute force search, it's much faster. Sometimes the simplest algorithms aren't always the fastest.

The finite automaton string search algorithm

First, the algorithm creates a finite state machine, which is a directed graph (see Book III, Chapter 5) where each node represents a single character in the target string to find. So if you wanted to find the string GAG, this algorithm creates a finite state machine consisting of three nodes with one node representing a starting state where the string hasn't been found yet. The first node represents finding the first letter G; the second node represents finding the second letter A; and the third node represents finding the final letter G, as shown in Figure 3-6.

After this algorithm creates a finite state machine for the target string, it next examines each character in the search string. In this example, the search string is GAAGGAGAA.

A finite state machine consists of nodes and arrows.

Figure IV.3-6. A finite state machine consists of nodes and arrows.

Initially, the algorithm starts at node 0. The first character it finds is the letter G, so it moves to node 1. The second character it finds is the letter A, so it moves to node 2. However, the third character it finds is the letter A, so it starts back at node 0 again.

The fourth character it finds is the letter G, so it moves back to node 1. The fifth character it finds is also the letter G, so it stays at node 1. The sixth character that it finds is the letter A, so now it moves to node 2. The seventh character that it finds is the letter G, so it moves to the last node that signals a match has been found.

Note

This algorithm is commonly used in Internet search engines, such as Google and Yahoo!

Searching with Regular Expressions

The finite automaton string search algorithm is the basis for a special text searching technique known as regular expressions (sometimes abbreviated as RegEx). Rather than write your own code to implement a finite automaton search algorithm, you can use regular expressions that are built-in to many languages (such as Perl and PHP) or added as libraries (such as Java and .NET languages like C# and Visual Basic).

The basic idea behind regular expressions is to search not just for specific strings but also for patterns. This provides greater flexibility because you may not always know exactly what you want to find. For example, a normal search algorithm doesn't find a string unless you know the entire string you want to find, such as the last name of Smith. If you only know the first three or four characters of a last name, you can use regular expressions instead.

Searching for single character patterns

The simplest regular expression pattern searches for a single character. Some single character patterns are shown in Table 3-1.

Table IV.3-1. Single Pattern Regular Expressions

Pattern Character

What It Finds

Example

Find

. (period)

Any single character

s.m

sum sam

w

Any letter or number

wats

cats 8cats

W

Any character except a letter or number

213W

213-213@

d

Any number from 0-9

ddd-1234

597-1234 409-1234

D

Any character except a number from 0-9

WD9-1234

WP9-1234 W$9-1234

Suppose you want to search for a string that begins with the letters c or f. To specify specific characters, you can define your own set, like this:

[cf]at

This regular expression finds strings, such as cat and fat but not rat. Sometimes it may be easier to define which characters you don't want rather than the ones you do want. In that case, you can use the ^ character in a set to define which characters you don't want, like this:

[^rht]oss

This finds any four-character strings that end with oss except for ross, hoss, and toss. The ^rht expression tells the computer to match any characters except for r, h, and t.

Searching for multiple character patterns

If you want to search for multiple characters, you could use a single character pattern several times. For example, to find a three-number string, you could use this regular expression pattern:

ddd

However, if you want to find a string consisting of one hundred numbers, typing d one hundred times is impractical. As an alternative, you can specify multiple patterns with either the * or + symbol.

Both symbols appear directly after a single-character pattern, such as d* or d+. The * symbol looks for zero or more instances of the pattern whereas the + symbol looks for one or more instances of the pattern. So d*123 finds the strings 9123, 899123, and 123 but d+123 finds only the 9123 and 899123 strings.

The * and + symbols can also work with single character sets, like this:

[rht]*oss

This searches for any string that ends with oss and contains zero or more of the r, h, or t characters, such as oss, thtrtoss, and tthrrhhhhhtoss.

Searching for alternate patterns

If you want to find the names John Smith and Mary Smith, you could search twice. However, a simple solution is to search for both patterns at the same time with the alternation operator (|), like this:

John Smith | Mary Smith

This regular expression tells the computer to find either the string John Smith or Mary Smith. You can combine the alternation operator with any pattern, like this:

[rht]*oss | d+-1234

This regular expression finds strings, such as rthrross and 5-1234.

Searching Phonetically

Regular expressions can make it easy to find strings when you only know part of the characters to find. However, sometimes you may know the pronunciation of a string you want to find, but you aren't sure of the exact spelling. Trying to find the word elephant with a regular expression of elefaw* doesn't work if you don't realize that the ph sound in elephant makes an f sound. To search strings phonetically, use a phonetic algorithm, such as the Soundex algorithm.

Note

The Soundex algorithm was actually patented in 1918 by Margaret O'Dell and Robert C. Russell. This algorithm is based on dividing spoken speech into six phonetic classifications based on where you put your lips and tongue to make sounds.

Basically, the Soundex algorithm converts each string into a numeric code that begins with the first letter of the string. So if you had the name Harry, the Soundex algorithm might convert that name into the code H600 by following these steps:

  1. Capitalize all letters in the string.

  2. Retain the first letter of the word.

  3. Change all occurrences of the following letters to 0 (zero):

    A, E, I, O, U, H, W, Y

  4. Replace any letters with the following numbers:

    1 = B, F, P, V

    2 = C, G, J, K, Q, S, X, Z

    3 = D, T

    4 = L

    5 = M, N

    6 = R

  5. Replace all pairs of identical digits with a single digit, such as replacing 66 with just 6.

  6. Remove all zeros from the string.

  7. Pad the string with trailing zeros so the entire Soundex code consists of the following format:

    <uppercase letter> <digit> <digit> <digit>

Table 3-2 shows how the Soundex algorithm calculates identical Soundex codes for the strings Harry and Hairy.

Table IV.3-2. Calculating a Soundex Code

Soundex Algorithm Step

String #1 (Harry)

String #2 (Hairy)

1

HARRY

HAIRY

2

H

H

3

H0RR0

H00R0

4

H0660

H0060

5

H060

H060

6

H6

H6

7

H600

H600

If you had the strings Harry stored in a text file, the Soundex algorithm converts that string into the H600 code. Now if you searched for the string hairy with the Soundex algorithm, the computer converts hairy into the Soundex code H600 and then finds the same H600 code stored at the string Harry, thus finding a matching string phonetically.

Note

Phonetic algorithms are used most often in spell checkers. The algorithm calculates a code for each misspelled word and then matches that code to correctly spelled words that have the same phonetic code.

String searching algorithms must examine every character, so the only way to speed up an algorithm is to simplify how it examines a string of text. Paradoxically, string searching algorithms often run faster by organizing text in a specific way and then searching that method of organization rather than the actual text itself, which is how the Shift Or and Soundex algorithms work. As a general rule, the faster the string searching algorithm, the harder and more complicated it is to implement.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset