Chapter IV.2. Searching Algorithms

One of the most common functions of a computer program is searching. A database needs to search through names and addresses, a word processor needs to search through text, and even a computer chess game needs to search through a library of moves to find the best one.

Because searching is such an important part of computer programming, computer scientists have developed a variety of algorithms to search for data. When searching for data, the main limitation is time. Given enough time, any search algorithm can find what you want, but there's a big difference between finding data in five seconds or five hours.

The time a search algorithm takes is always related to the amount of data to search, which is the search space. The larger the search space, the slower the search algorithm. If you only need to search a small search space, even a simple and slow search algorithm is fast enough.

The two main categories of search algorithms are

  • Uninformed (or brute-force)

  • Informed (or heuristic)

Uninformed, or brute-force, search algorithms work by simply examining the entire search space, which is like losing your car keys in your apartment and searching every apartment in the entire building. Eventually, you find your keys, but it may take a long time to do it.

Informed, or heuristic, search algorithms work by selectively examining the most likely parts of the search space. This is like losing your car keys in your apartment but only examining the bedroom where you last saw your car keys. By using knowledge of the search space, informed search algorithms can speed up a search algorithm by eliminating obvious parts of the search space that don't contain the data you want to find.

Uninformed (or brute-force) search algorithms are much simpler and faster to write in any programming language, but the price you pay may be slower searching speed. Informed (or heuristic) search algorithms always take more time to write, but the speed advantage may be worth it especially if your program needs to search data on a regular basis. One problem with informed (or heuristic) search algorithms is that they often require that the data be sorted first or stored in a data structure that requires more complicated traversing through all items, such as a tree data structure.

Warning

The perfect search algorithm is easy for you to implement in your favorite programming language while also being fast enough for your program.

Sequential Search

A sequential search is an example of an uninformed search algorithm because it searches data one item at a time starting from the beginning and searching through to the end. In the best-case scenario, a sequential search finds data stored as the first element in a data structure. In the worst-case scenario, a sequential search has to search an entire data structure to find the last item stored, as shown in Figure 2-1.

The speed of sequential search depends directly on the size of the data to be searched.

Figure IV.2-1. The speed of sequential search depends directly on the size of the data to be searched.

To speed up sequential searching, you can add simple heuristics. Some popular ways to speed up sequential searching include

  • Backward or forward searching

  • Block searching

  • Binary searching

  • Interpolation searching

Backward or forward searching

If the data is sorted, you can make the sequential search start looking through a data structure from either the beginning or the end. So if you need to search an array that contains numbers organized in ascending order from 1 to 100, searching for the number 89 will be faster if you start at the end of the array, as shown in Figure 2-2.

Sequential search can be made faster by searching from either the front or end of a data structure.

Figure IV.2-2. Sequential search can be made faster by searching from either the front or end of a data structure.

The backward or forward searching algorithm works like this:

  1. Compare the value to find the number of items stored in a sorted data structure.

  2. If the data to find is in the first half of the data structure, start at the front of the data structure; otherwise, start at the end.

  3. Search sequentially until the data is found or confirmed not to exist in the data structure.

Tip

Searching either backward or forward also has an advantage when searching through data structures that organize data by age. If data is stored in a queue, the oldest data appears at the end, and the newest data appears at the beginning. So if you can identify the age of the data you want to find, you could speed up the search for knowing whether to start at the beginning of the queue or the end.

Block searching

Another technique to speed up sequential searching on sorted data is jump, or block searching. Rather than search one item at a time, this method jumps over a fixed number of items (such as five) and then examines the last item:

  • If this last item is greater than the value the algorithm is trying to find, the algorithm starts searching backward.

  • If this last item is less than the value the algorithm is trying to find, the algorithm jumps another block forward, as shown in Figure 2-3.

Jump, or block, searching can speed up a sequential search on sorted data.

Figure IV.2-3. Jump, or block, searching can speed up a sequential search on sorted data.

The block searching algorithm works like this:

  1. Jump ahead a fixed number of items (a block).

  2. Compare the last value of the block:

    • If this value is less than the data to find, search sequentially within the block.

    • Otherwise, jump to the end of a new block and repeat Step 2.

The basic idea behind block searching is to skip ahead through a sorted list of data and then slow down when it gets closer to that data. This is like looking through a telephone book for the name Winston Smith by skipping every ten pages until you reach the S section and then searching sequentially until you find the name Smith and finally the name Winston Smith.

Warning

Block searching can work only with sorted data. If data isn't sorted, block searching can't work at all.

Binary searching

A variation of block searching is binary searching, which essentially uses a block half the size of the list. After dividing a list in half, the algorithm compares the last value of the first half of the list. If this value is smaller than the value it's trying to find, the algorithm knows to search the second list instead. Otherwise, it searches the first half of the list.

The algorithm repeatedly divides the list in half and searches only the list that contains the range of values it's trying to find. Eventually, the binary search finds the data, as shown in Figure 2-4.

Binary searching divides a list in half until it eventually finds its data.

Figure IV.2-4. Binary searching divides a list in half until it eventually finds its data.

The binary search algorithm works like this:

  1. Divide a sorted list in half.

  2. Compare the last value of the first half of the list.

    If this last value is less than the desired value, search this half of the list. Otherwise, search the other half of the list.

  3. Repeat Steps 1 and 2 until the desired value is found or confirmed not Algorithms Searching to exist.

Interpolation searching

Instead of jumping a fixed number of items, like block searching, or dividing a list in half, like binary searching, interpolation searching tries to guess the approximate location of data in a sorted list. After it jumps to the approximate location, the algorithm performs a normal sequential search, as shown in Figure 2-5.

Interpolation searching tries to jump straight to the approximate location of the target data.

Figure IV.2-5. Interpolation searching tries to jump straight to the approximate location of the target data.

Interpolation searching mimics the way a person might look up a name in a telephone book. If you're looking for the name Winston Smith, you jump straight to the S section. Then you slow down to look for the Smith name, and slow down even more to look for all Smith names whose first name begins with W until you find Winston Smith.

Warning

Although potentially faster than other forms of sequential searching, interpolation searching requires enough knowledge to jump as close to the desired data as possible. Because this might not always occur, interpolation searching isn't always faster than other forms of searching, such as binary searching.

Interpolation searching follows these steps:

  1. Jump to the approximate location of the target data in a sorted list.

  2. Start searching sequentially until the desired data is found or confirmed not to exist.

The key to interpolation searching relies on the computer accurately jumping to the position where the data is likely to be stored. One way of guessing the location of data is to use Fibonacci numbers, creating the Fibonacci searching technique.

Note

Fibonacci numbers are a series of numbers that are calculated by adding the last two numbers in a series to determine the next number in the series. So the first Fibonacci number is 0, the second is 1, the third is 1 (0 + 1), the fourth is 2 (1 + 1), the fifth is 3 (1 + 2), the sixth is 5 (2 + 3), and so on like this:

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, and so on . . . .

Fibonacci numbers tend to occur in nature, such as measuring the branching of trees or the curves of waves. So the idea behind Fibonacci searching is rather than divide a list in half, like binary searching, Fibonacci searching divides a sorted list into progressively smaller lists, based on Fibonacci numbers, until it finally finds the data or confirms that the data doesn't exist. Surprisingly, this method works consistently faster than binary searching.

Fibonacci searching works like this, as shown in Figure 2-6:

Fibonacci numbers divide and search a list more efficiently than a binary search.

Figure IV.2-6. Fibonacci numbers divide and search a list more efficiently than a binary search.

  1. Determine the size of the sorted list (dubbed n).

  2. Find the largest Fibonacci number that's less than the size of the sorted list (dubbed p).

  3. Examine the value stored at the pth location of the sorted list.

    If this value is the one you want to find, stop.

  4. If the value at this pth location is less than the data you're searching for, search the list to the right of the pth location. If the value at this pth location is greater than the data you're searching for, search the list to the left of the pth location.

  5. Repeat Steps 1–4.

Using Indexes

Imagine yourself trying to find a certain store in a large shopping mall. You could wander up and down the corridors and examine each store, one by one, which is like a sequential search. Even if you use the various sequential search tactics, like block searching, sequential searching can still take a long time.

Instead of searching each store sequentially, here's a faster way: Look at the mall directory, find the store you want, and then walk straight to that store. That's the difference between sequential searching and indexes. An index points you directly toward the item you want to find no matter how many items there may be. Indexes basically act like a shortcut to searching.

Creating an index

Indexes are similar to hash tables (see Book III, Chapter 3). The main difference is that a hash table calculates a unique value based on the total data stored whereas an index typically stores part of the data in a separate table that points to the rest of the data, as shown in Figure 2-7.

Indexes are most often used in databases. If you organize a list of data in rows and columns with each column representing a field (such as Name or Phone Number) and each row representing a record (that contains one person's name and phone number), an index can be as simple as a single column that consists of the data you're most likely to use for searching.

For example, if you have a database of names, addresses, and phone numbers, you probably spend more time looking up someone's phone number by looking up his last name. So you could use the last name field as an index. At the simplest level, an index is nothing more than an organized list of existing data, such as a list of last names organized alphabetically (see Figure 2-7).

If you create an index based on last names and you need to search by last name, an index can find your data faster. However, what if you want to search by phone number or city, but your index consists only of last names? In that case, you can create multiple indexes, one for each type of data.

Comparison of hash tables and indexes.

Figure IV.2-7. Comparison of hash tables and indexes.

Clustered and unclustered indexes

Here are two types of indexes — clustered and unclustered.

A clustered index sorts the actual data. If you have a list of names and addresses, a clustered index could sort the data by last name, which physically rearranges the data in order. Because a clustered index physically rearranges data, you can have only one clustered index per file. Sorting data by a single field, such as last name, is an example of a clustered index.

An unclustered index doesn't physically rearrange data but creates pointers to that data. Because unclustered indexes don't rearrange the data, you can have as many unclustered indexes as you need. The drawback of unclustered indexes is that they're slower than clustered indexes. A clustered index finds the data right away whereas an unclustered index needs an extra step to search the index and then follow the pointer to the actual data, as shown in Figure 2-8.

Clustered indexes physically rearrange data whereas unclustered indexes point to data.

Figure IV.2-8. Clustered indexes physically rearrange data whereas unclustered indexes point to data.

Problems with indexes

A single clustered index makes sense because it rearranges data in a specific way. Multiple, unclustered indexes can help search data in different ways. Although indexes make searching faster, they make inserting and deleting slower because every time you add or delete data, you must update and organize the index at the same time.

If you've created multiple indexes, adding or deleting data means having to update every multiple index. If you have a small amount of data, creating and using an index may be more trouble than it's worth. Only when you have large amounts of data is an index (or multiple indexes) worth using.

Adversarial Search

One of the most popular uses for searching is an adversarial search. This type of search is often used to create artificial intelligence in video games.

Essentially, the computer analyzes the current game situation, such as a tic-tac-toe game, and calculates its list of possible moves. For each possible move, the computer creates a tree where the root node represents one of the computer's possible moves and each alternating level represents the human opponent's possible counter-moves, as shown in Figure 2-9.

Each possible move is given a specific value:

  • A high value signifies a good move.

  • A negative value signifies a bad move.

Assuming the human opponent chooses one possible counter-move, the next level of the tree displays the computer's possible responses and so on.

The more levels (or plys) the computer can analyze, the more it can anticipate and plan ahead and the smarter the computer can appear.

Depth versus time

Given enough time, the computer can examine every possible move and all possible counter-moves until it finds the best move to make that will lead to its inevitable victory. In simple games, like tic-tac-toe, where the number of choices is finite, this approach of searching all possible moves, called brute-force, works. When applied to more complicated games, like chess, such a brute-force approach takes way too long.

To reduce the amount of time needed to search, computer scientists have come up with a variety of solutions. The simplest method is to reduce the number of plys the computer examines for each move.

A tree can analyze the best possible move.

Figure IV.2-9. A tree can analyze the best possible move.

This is how many games offer beginner, intermediate, and expert modes. The expert mode may search 24 levels in each tree, the intermediate mode may only search 12 levels, and the beginner mode may only search 4 levels. Because the beginner mode searches far fewer levels than the expert mode, it runs faster and doesn't appear as smart as the intermediate or expert modes.

Note

When a computer doesn't search beyond a fixed number of levels, it can miss potential problems that it might have discovered if it had just searched a little bit deeper. This event is dubbed the horizon effect because the computer doesn't see the consequences beyond a certain move, so the problem appears to lie outside the computer's sight or beyond the horizon.

Alpha-beta pruning

Another way to speed up searching is to use alpha-beta pruning. The idea behind this tactic is that it's relatively pointless to keep searching a tree if a potential move would represent a horrible choice. For example, in a chess game, two possible moves might be moving the king into a position where it'll get checkmated in two moves or moving a pawn to protect the king.

If a computer always searches every possible move down to 12 levels, it wastes time evaluating the bad move that results in the king getting checkmated in two moves. To save time, alpha-beta pruning immediately stops searching a tree the moment it detects a losing move and makes the computer focus on studying good moves instead. As a result, the computer's time can be spent more profitably examining good moves.

For example, consider the tree in Figure 2-10; the boxes represent possible moves for the computer, and the circles represent possible counter-moves by a human opponent. The higher the value in each box or circle, the better the move. So the human opponent will most likely choose moves with high values. In response, the computer must look at the best possible countermoves based on what the human opponent is likely to choose.

Assigning values to possible moves helps the computer evaluate the best possible move

Figure IV.2-10. Assigning values to possible moves helps the computer evaluate the best possible move

So if the computer considers a move with a value of 6 (the root node), the human opponent might have 4 possible moves with values ranging from −45 to 68. Assuming the human chooses the best move (68), the computer may have a choice of only 2 possible moves (−6 and 8). The goal is to choose the best possible move (max) for the computer (max) that leaves the human opponent with a choice of nothing but the worst possible moves (min), so arranging moves on a tree and assigning values is known as a min-max tree.

Assuming the computer chooses this original move (6) and the human opponent responds with the best possible move of 68, the computer now has a choice of evaluating the −6 or 8 move. Because there's no point in evaluating the −6 move, alpha-beta pruning would stop the computer from further evaluating this move and just focus on the 8 move instead.

Looking up a library of good moves

Alpha-beta pruning relies on examining every tree of possible moves and immediately cutting off the least promising ones. Obviously, some moves aren't worth considering, but the computer has no way of knowing that until it evaluates every move.

However, at the beginning of every game, there's always a list of good and bad moves, so many games include a library of these good moves. Now at the start of the game, the computer doesn't have to waste time searching every move but can just pick from its library of best possible moves and examine those moves in depth.

A way to use this technique in the middle of a game is to analyze all possible moves in two steps. In the first step, the computer only examines every possible move through a small number of levels, such as two. The idea is that most bad moves can be identified immediately, like moving a queen in a chess game so it can be captured by the opponent's pawn in the next move.

After the computer examines all possible moves in such shallow depth, it can eliminate the obviously bad moves and then for the second step, examine the remaining moves in more detail. Although this technique takes slightly more time to examine all possible moves through a shallow depth, it ultimately saves time by preventing the computer from examining both bad and good moves at a much deeper level.

Ultimately, searching always involves examining every item in a list, which means the larger the list, the longer the search time. The only way to speed up searching algorithms is to use different techniques to maximize the chances of finding data as soon as possible.

The simplest way to speed up any search algorithm is to sort the data beforehand. After a list has been sorted, the computer can use various techniques, such as block jumping or Fibonacci searching, to speed up the search.

If data isn't sorted, it may be possible to use an index. An index works most effectively when it organizes part of the stored data, such as indexing the last names of a customer list that contains names, addresses, and phone numbers. Although indexes can speed up searching, their need for constant updating makes adding and deleting data slower.

Rather than search through stored data, strategy games, such as chess, must search through continuously changing data based on the position of the game pieces. To search through this ever-changing data, strategy games must rely on techniques to quickly eliminate bad moves so the computer can spend its time focusing only on evaluating the best moves.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset