Comparing Blocks of Data

One action you are bound to need from time to time is that of comparing blocks of data. This may happen when you need to sort data, find some specific data element, or traverse a base of data. In the case of the example data file, which consists of textual article descriptions, this comparing of blocks of data basically translates into comparing strings. Note that strings are, in effect, nothing more than arrays of bytes, just like any other data block; the only difference is that strings are generally terminated by a null character. In this chapter, blocks of data will be treated as strings and text files as this makes the examples more readable.

Standard Pattern Matching

Comparing two patterns of data—or strings—in C++ is not that complicated because the C++ programming language contains a standard string implementation which can handle this. This string implementation is part of something called the Standard Template Library (STL) ; consult your compiler documentation for more information on STL. Comparing two standard C++ strings can be done as demonstrated in the following example:

#include <string>
#include <iostream>

using namespace std;

void main()
{
  string s1("This is string one.");
  string s2("This is string two.");

  if (s1 == s2)
  {
    // do something.
  }
}

The reason it is possible to compare two objects of the string class is that the == operator has been overloaded to perform just this action. In C, the piece of code would look like this:

#include <string.h>

void main()
{
  char s1[]= "This is string one.";
  char s2[]= "This is string two.";

  if (strcmp(s1,s2) == 0)
  {
    // do something.
  }
}

In order to test the merits of these two implementations, a program is used that loads and searches through the dbase.txt file, counting the number of Small Table articles in the colors yellow, black, silver, and magenta. The program can be found in the file 10Source01.cpp on the Web site. The article name and color are the first attributes of every article in the dbase.txt file, so a quick compare with the beginning of each line in the file will determine whether the article is to be counted or skipped. The timing mechanism of the program does not time the loading of the database, only the time that is spent in the search algorithms. You can examine the exact implementation of the program at your leisure; for now, it is important only to mention that this straightforward comparison between using the C++ string class and using char pointers proves the char pointer method to be much faster than the string method (two to five times faster, depending on the system used).

It seems that although the standard C++ string is very easy to use and offers a great many standard functions, it is not what you will want to use when you have to search through a large block of data.

Faster Pattern Matching

Without even changing the way in which you actually compare patterns (or strings), you can make your string comparisons faster in a lot of cases by using a macro that can skip the actual call to the strcmp function:

#define macro_strcmp(s1,s2) ((s1)[0]× != (s2)[0] ? (s1)[0] - (s2)[0]:strcmp((s1), (s2)))

if (macro_strcmp(s1, s2) == 0)
{
    // do something
}

This macro is used to determine whether the first characters of both strings match. When the first characters are the same the strcmp function is called as you would normally do; when the first characters are not the same, however, the string will of course be different and strcmp is not called. This will save a lot of time for all strings of which the first character is different. Note that this even works for empty strings.

Only when different strings contain the same first character does this new method introduce extra overhead because of the extra check introduced by the macro. It depends on the type of strings in the database, therefore, whether or not this macro represents an improvement. For our example of finding Small Table articles in the dbase.txt file, this macro implementation is again faster (up to 34%). This is reflected in the results of the 10Source01.cpp program.

Another way to speed up string comparison is by treating the character arrays as integer arrays . On most systems an integer is four times larger than a character, and so four characters can be compared at the same time. When the strings have a length that is a multiple of the length of an integer, doing integer comparisons can greatly speed up your string comparison. Listing 10.1 shows what an integer string compare function can look like.

Code Listing 10.1. Integer String Comparison
inline int int_strcmp(char *s1, char *s2, int len)
{
    if ((len & 0x03) != 0)
        return(macro_strcmp(s1, s2));

    for (int i =0; *(int*)&s1[i] == *(int*)&s2[i];)
    {
        i+=4;
        if (i >= len)
            return(true);  // match
    }
    return(false);
}

The int_strcmp function quickly checks whether the given string length is a multiple of four. If this is not the case, the previously discussed macro_strcmp function is called. For strings with a correct length, characters are compared in groups of four by casting the character pointers to integer pointers and thus reading integers from the string. The longer the compared strings are—and the more they look alike—the more benefits this int_strcmp function reaps compared to the previous implementations given in this chapter. For our example of finding Small Table articles in the dbase.txt file, this integer implementation is again faster (up to 50%). This is reflected in the results of the 10Source01.cpp program.

The reason why the string lengths have to be a multiple of the integer size for this function to work is that any other size will cause this function to compare beyond the length of a string. A string with a length of six, for instance, will be compared in two loop iterations. The first iteration compares the first four bytes, which is no problem. The second iteration compares the second four bytes, of which only two are part of the actual string. The third byte will be a null indicating the end of the string. The fourth byte, however, is not part of the string. Two strings of six characters that are identical could be found to be different just because of the value of these fourth bytes. The int_compare function can of course be altered to check for different string sizes but it is not very likely that it will still be faster in most string compare cases.

This may make you wonder whether the int_compare function is all that useful. It is, in fact, as databases often contain records with fixed field sizes anyway. By simply choosing a field size that is a multiple of the integer size, using the speedy int_strcmp becomes possible for those database records. This can be seen in the results of the 10Source01.cpp program.

Table 10.1 shows the complete results of the 10Source01.cpp program . Note that different systems may yield different absolute results, but the relations between the speeds of the various techniques should be fairly constant.

Table 10.1. Results of 10Source01.cpp
Compare C++ string 241
Compare strcmp 109
Memory compare 106
Compare macro 72
String 'n' compare 41
Compare ints 34

Note that the standard memory compare function memcmp basically yields the same results as the standard string compare strcmp. When you look at the source of 10Source01.cpp, you will also notice that memcmp needs to be told how many characters to compare. For straightforward string comparison, strcmp is, therefore, the better choice as it determines the string length without needing extra information—like a call to the strlen function, for instance. When comparing blocks of data that are not null terminated, you can, of course, use the strncmp variant, which does not necessarily need null terminators. Be advised, though, that string functions stop when they encounter a null byte!

Simple Pattern Search

Another action you are bound to need when handling blocks of data is that of finding a certain pattern of elements in a block of data. This can be compared to finding a substring in a string. Think, for instance, of using the Find function of a word processor to find occurrences of a word in a piece of text. This could, of course, be done by repeatedly calling strncmp to compare a search string against a piece of the text. Each call to strncmp would look at the same number of characters from the text, but each time starting at the next character in the text. Luckily, a standard function that does this already exists and it is called strstr. Usage could be as follows:

char *s1; // pointer to the text.
char s2[] = "substring to look for."
char *result;

if ((result = strstr(s1, s2)) != NULL)
{
    // do something
}

When string s2 can be found in text s1, the function strstr returns a pointer to the first byte of the first occurrence. Null is returned when string s2 cannot be found in text s1.

Faster Pattern Search

You can improve on standard string search mechanisms by not simply testing all possible substring combinations, but making informed decisions that allow you to skip ahead as much as possible. A whole range of algorithms have been derived from such a technique, which was developed by Boyer and Moore. This section presents a possible variation in order to demonstrate the basic principles. The exact algorithm variation that is best for a certain part of an application depends heavily on the data set it uses and the kind of search that is to be performed. The algorithm presented in this section can easily prove to be many times faster than the standard strstr function, but perhaps you can think of specific tweaks that improve it even more for the specific data set you want to use it on.

As with strstr, the basis of finding a search string in a block of text or a file is that of repeatedly comparing the search string against different substrings in the file. This time, however, you will use information gathered when a mismatch occurs to make an informed decision about where exactly the next substring should start. The remainder of this section refers to the string that is to be found as the search string, and the block of text it is to be found in as the text file.

At the heart of the algorithm is a loop that compares two strings: the search string and a substring of the text file. It does this character-for-character starting at the end of the strings and working its way back to the beginning of the strings:

j = len2;
while (textf[j] == searchstr[j] && --j >= 0) ;

When all the characters of both strings match, the algorithm is, of course, finished and it returns a pointer to the first character of the substring in the text file. When, however, a mismatch of characters is found, things become interesting:

if (j != -1)
{
     // Mismatch; align and check for text end.
}
else
{
     // Match; stop, return pointer.
}

A lookup character is taken from the text file. This is the first character following the substring which was just used in the comparison. When this lookup character can also be found in the search string, the search string is aligned with the text file in such a way that these two characters match up. Then comparison starts over. If, however, the lookup character is not part of the search string, the search string is aligned as follows: The first character of the search string is aligned with the character following the lookup character. Here is a textual example of how the search string methin is found in a text file containing the text no thing as vague as something.

  no thing as vague as something.
1 methin
2  methin
3         methin
4                methin
5                       methin
6                        methin

The very first comparison that is made is that of the search string methin being compared with the first six characters of the text file no thi. This is step 1.

As the comparison starts at the last character of each (sub)string, the letter n is compared with the letter i. These characters, of course, do not match, and so a lookup character is taken from the file. This is the first character in the file following the substring, in this case the letter n of the word thing. The letter n is also part of the search string and so the search string is aligned with its rightmost n (which is in fact its only n) corresponding with the lookup character. We arrive at step 2.

Comparison starts again and this time a whole range of characters match, four in total. Not all of the search string matches though, and the first mismatch is the letter e of the search string which does not match up with a space in the text file. Time for another alignment, it seems. The lookup this time is the letter g of the word thing in the text file. The search string does not contain a letter g and so it is aligned with its first character just past the lookup character. This way, we arrive at step 3.

Now things start to speed up. In this comparison, the n of the substring is compared with the a of the word vague in the text file. Again alignment can take place after the lookup character (another g ), sending us to step 4. In step 5 the same happens again, this time with the letter s as a lookup character.

The lookup character that brings us from step 5 to step 6 is the letter n of the word something in the text file. The search string contains a letter n and so it is aligned accordingly. This time the search string matches the substring of the text file completely. We have a winner!

Note that a simple moving strncmp would have needed 23 steps whereas this algorithm needs only six.

The implementation of this concept comes in two parts. One part is the actual search algorithm, the other is an initiating function which creates a table that is consulted by the search algorithm in order to determine where the next substring should start. Listing 10.2 shows an implementation of a search routine; it takes a pointer to the text file and the length of the text file as input (the search string itself is processed by the initiating function as you will see later):

Code Listing 10.2. Fast Pattern Match Search Function
char *search(const char *textf, int len)
{
    // Check if searchstr is longer than textf or zero length.
    if (search_len > len || search_len == 0)
        return NULL;

    unsigned char * end = (unsigned char*) textf + len;
    int len2 = search_len -1;

    int j;

    for(;;)
    {
        // main loop for comparing strings.
        j = len2;
        while (textf[j] == searchstr[j] && --j >= 0) ;

        if (j != -1)
        {
            // Mismatch; align and check for end of textf.
            unsigned char* skipindex = (unsigned char*) textf + search_len;
            if (skipindex >= end)
                return NULL;  // string not in textf.

            textf += skips[*skipindex];
        }
        else
        {
            // Match found.
            return (char*) textf;
        }
    }
    return NULL;
}

The main parts of this function have already been explained. Note, however, that there is an if at the beginning of the function to determine whether any search is needed at all. After this follow two calculations which prepare variables so they do not need to be recalculated during the main loop. The only other new bit consists of the four lines of code to be executed at the mismatch. These lines of code find the lookup character and use it to determine the alignment with a new substring. For this alignment the array skips is used. It contains the exact number of characters to skip for each possible lookup character. The skip value for a given lookup character is as follows: For every character that is not part of the search string, the skip value is equal to the number of characters in the search string plus one. For every character that is part of the search string, the skip value is its position in the search string counted from the end of the string. For our example search string methin, the skips array will therefore have the following values:

   m = 6
   e = 5
   t = 4
   h = 3
   i = 2

   n = 1
   Every other character = 7

This means that for each different search string that you want to find, you will have to make one call to a function that will create this array of skip values for you. Listing 10.3 shows an implementation of such a function: It takes a pointer to the search string as an input parameter.

Code Listing 10.3. Fast Pattern Search Initiation Function
#include <string.h>
unsigned char skips[256];  // possible nr of look-ups.
unsigned char search_len;  // search string length.
char * searchstr;          // actual search string.

void init(const char *str)
{
    search_len = strlen(str);
    int len2 = search_len + 1;

    // For chars not in searchstr.
    for (int i = 0; i < 256;i++)
        skips[i] = len2;  //length + 1.

    // For chars in searchstr, with double chars only
    // right most survives.
    for ( i = 0; i < search_len; i++)
        skips[str[i]] = search_len - i;  // pos back to front.

    searchstr = (char*) str;
}

Note that two loops are utilized to create the skips array. One loop sets the skip values for the characters that are not in the search string (in fact, it simply sets the whole array to this default value). The other loop sets specific values for the characters that are part of the search string. Because this second loop traverses the search string from front to back, something special happens to characters that occur in the search string more than once. Only the rightmost occurrence of such a character is remembered in the skips array. It is only possible to remember a single occurrence of each character because remembering each occurrence of a character, and using it, would simply cause too much overhead. The rightmost occurrence of a character is chosen as a survivor as it allows greater skips. Here now is an example of the use of the search algorithm:

char textfile[] = "no thing as vague as something.";

void main (void)
{

    init("methin");
    char *place = search(textfile, strlen(textfile));
}

Note that a search routine such as this one is not particularly suitable for simply trying to determine whether two strings are equal, as was the subject of previous sections. The reason for this is that a searching algorithm introduces some overhead before getting to the actual string compare work. This Fast Strings search does not, however, stop on encountering a null byte. This means it can be used without problem on any kind of (non-string) data set.

The nature of the text file and the search string determine how effective search string algorithms are. Dependencies are, for instance, the length of the search string, the length of the text file, the number of unique characters in the search string, and how often these characters appear in the text file. Generally it can be said, though, that the longer the search string is, and the less often its characters appear in the text file, the faster the fast search algorithm will be.

The complete source of the Fast String search can be found in the companion file: 10Source02.cpp , which compares searching a string with strstr to the Fast String search and a case-insensitive Fast String search (which is explained in the next section). As was noted at the beginning of this section, you can improve on this basic Fast String search by analyzing the data set you want to use it on. For instance, for text files of which the length is known—because it is fixed or perhaps because memory has just been allocated for it—no length calculation is needed in the call to the search function.

Case-Insensitive Fast String Search

When a pattern search is performed for strings of characters, often a match needs to be found regardless of the case of the characters in the search string or the text file. This is called a case-insensitive search. With a few minor changes the fast search string presented in this section can be transformed into such a case-insensitive search. To be able to do this kind of search quickly, you need two more data structures. The first new data structure will store a copy of the search string, containing only small characters. With this copy you can speed up the search considerably because you have fewer cases to test against: only small search string characters against lowercase or uppercase text file characters. You can of course dynamically allocate the exact amount of memory needed for this copy, but this will take time. If you can spare the footprint you may opt to use a static array. However, remember that the size of this array effectively determines the maximum size of the search string. Therefore it is important to choose the size of this array sensibly. In Listing 10.4, this array is called searchstr. The second new data structure will contain information on where in the search string characters of the alphabet occur. This structure also helps speed up the search but its main function is to guard against an obvious error; trying to match a nonalphabetic search string character against its capital in the text file. As you know, only alphabetic characters have a corresponding capital letter. This data structure is in effect a Boolean array and can be mapped onto anything from a bit array to an integer array. The choice here again is speed versus performance. Note that the chosen structure should be able to hold a value for each character in the search string; this means that its length should correspond to that chosen for the static search string array. In the code sample to follow this second data structure is a simple byte array called alfa.

Now let's look at the changes to be made to the Fast String search functions. First there is the init function, which was in charge of creating the skips array. It will now also initialize the searchstr array and the alfa array. Furthermore, the skips array will contain skip values for lowercase as well as uppercase letters found in the search string. An example:

Search string: methin

skips array: 6 5 4 3 2 1 6 5 4 3 2 1    7
alfa array:  1 1 1 1 1 1 1 1 1 1 1 1    0
             M E T H I N m e t h i n    other chars

Listing 10.4 shows an example of an init function which can be used for a case-insensitive Fast String search.

Code Listing 10.4. Case-Insensitive Pattern Search Initialization Function
unsigned char searchstr[256];     // Max pattern length is 256.
unsigned char alfa[256];          // Remember where alphabet chars are located.
unsigned char skipsInsens[256];   // Skips array.
unsigned char patlenInsens;       // Length of search pattern.


// Init function.
void initInsens(const char *str)
{
    int len1 = 0;

    // Get length and make 'smalls'copy.
    unsigned char c = str[len1];
    while ( c != ' 0')
    {
        alfa[len1] = 0;
        if (c >= 'A'&& c <= 'Z')
        {
            searchstr[len1] = c + 0x20;
            alfa[len1] = 1;
        }
        else
        {

            searchstr[len1] = c;
            if (c >= 'a'&& c <= 'z')
                alfa[len1] = 1;
        }
        c = str[++len1];
    }

    int len2 = len1 + 1;

    // For chars not in pattern.
    for (int i = 0; i < 255; i++)
        skipsInsens[i] = len2;  //length + 1.

    // For chars in pattern.
    //  with double chars only right most survives.
    for ( i = 0; i < len1; i++)
    {
        skipsInsens[searchstr[i]] = len1 - i;                   
        if (alfa[i]) skipsInsens[searchstr[i]-0x20] = len1 - i; 
    }


    patlenInsens = len1;
}

The changes to the search function are minor, apart from having to use the new names for the search string, the search string length, and the skips array (if you did actually decide to use new names as was done in this section), but the check on the character match needs to be changed from

while (textf[j] == searchstr[j] && --j >= 0) ;

to

while (
   ((textf[j] == searchstr[j]) ||
     ((alfa[j]) && (textf[j] == (char) (searchstr[j]-0x20))))
   && --j >= 0) ;

What this new check actually does is first check whether a character of the search string matches a character in the text file. If it does, the function continues as it normally would. If, however, a character does not match and it is an alphabetic character, another check is done with an uppercase version of the search string character. So for a mismatch of a nonalphabetic character—for example '%'—we get the following dialog:

Does '%'match? No, stop comparing.

And for a mismatch of an alphabet character—for example 'a'—we get the following dialog:

Does 'a'match? No. Does 'A'match? No, stop comparing.

The complete implementation of the case-insensitive search string function can be found in the file 10Source02.cpp, which compares searching a string with strstr to the Fast String search and the case-insensitive Fast String search.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset