Efficient Text File IO

File access for text files is done through the same functions (and classes) as file access for binary files. You determine whether to treat a file as a text file or as a binary file through a flag when opening the file. The difference between files that are opened as text files and those which are opened as binary files lies in whether or not the accessing functions interpret Carriage Return (CR) and Line Feed (LF) codes found in the file data. A binary file is seen as one long string of byte values, whereas a text file is seen as a collection of strings separated by CR and/or LF codes. Generally, programmers open and create text files only when they are to contain actual character data (such as that defined by ASCII or Unicode). Giving CR and LF codes a special meaning allows you to specify further functionality, which can be handy when working with files that will contain text. It is very easy, for instance, to write functionality that searches for a certain line number in a text file or counts the number of lines in a file. This section looks into the different uses of text files in programming and supplies ready-to-use text file functionality, which can speed up text file access in programming.

Why Text Files?

You might wonder why it could still be a good idea to use text files in situations when binary files can store the same data more efficiently. Not only is it possible to omit CR/LF codes from binary files, but storing byte values as text strings can take up two or three times as much space—a single byte with the numerical value 255 is represented in text by a string of, for instance, two hexadecimal characters ('F', 'F') or three decimal characters ('2', '5', '5'). Moreover, compression can be used in binary files. (A compressed text file cannot really be called a text file anymore.) However, text files still have their place, even when we are talking about storing data. Proof of this is, for instance, the fact that many configuration files used under UNIX and Windows are still text files. Even Internet pages, described in HTML, are basically text files with predetermined formatting, not to mention the fact that your compiler uses text files (called source files) as input. The section looks at different uses of text files.

Text Files as Configuration Files

The most important reason for choosing text files as configuration files is that they can be read and changed by users without the use of special tools or programs. The program or process that is being configured does not even need to be started; any simple text editor will do the job. This can be a real lifesaver when a program is unable to start properly because of a faulty configuration! It also saves development time because programmers no longer have to write and test functionality to take the user's configuration input from a user interface environment and translate it to (or from) an actual binary configuration file. Another advantage is the fact that human-readable configuration files can perhaps be reconstructed by users when they become corrupted, and different versions of the files can be created through simple copy and paste actions. These configuration files can even be faxed to wherever this kind of support is needed.

Text Files as Data Storage

It is not always necessary to write a complete database management system in order to store data on file. In Chapter 10, "Blocks of Data," you saw how a text file can be used as a database. Again the advantages apply, as given in the section "Text Files as Configuration Files." Another advantage is that this kind of database file can easily be shared across platforms.

When using text files as data storage you effectively have two choices of implementation. You can use either fixed-length records, in which a certain record can be found by multiplying its number by the number of bytes contained in a record, or variable-length records (records containing variable-length fields), in which you appoint special characters as field separators. The following examples show different choices of implementation for a database that contains records with the following fields: item type, item color, item material, and item order number.

// Impl 1:Using fixed length records:
stool          red          wood       125600
chair          blue         iron       128000
rocking chair  aquamarine   cast iron  130000

// Impl 2:Using variable length fields with field separators:
stool#red#wood#125600
chair#blue#iron#128000
rocking chair#aquamarine#cast iron#130000

// Impl 3:Using field and record separators:
stool#red#wood#125600@chair#blue#iron#128000@rocking chair#aquamarine# cast iron#130000

When you are using length records you pay a footprint penalty because each record in the database, regardless of its content, is the same size. This size is determined by the largest possible value of each field. The upside to this implementation is that searching through the database is fast (O(1)) because you can use direct access. The first byte of record 15 is found at position 15 times the record size in bytes. The opposite is true for variable length records. Each record is made to fit exactly its content, and so footprint is less of an issue. However, access to the records needs to be sequential (O(n)), basically by counting the field separators. The implementation that is best for a given situation depends on the footprint requirements and the type of access which will be used most often.

When using text files as data storage, simple text functions can be used to manipulate the database records. Listing 12.8 shows a function that can be used to retrieve the value of a certain field from a text database which has been loaded into memory.

Code Listing 12.8. Getting a Field from a String
char *GetField(char *s, char c, int n, char *fld)
{
    int i=0, j=0, lens = strlen(s);

    if (n < 1)
        return (NULL);

    if (n > 1)  // Not the first field
    {
        i = GetOccurrence(s, c, n-1);
        if (i == -1)
             return(NULL);
        else
             i++;
    }

    j = GetOccurrence(s, c, n);
    if (j == -1)
        j = lens;

    if ( (j <= i) || (i > (int)lens) || (j > (int)lens) )
        return(NULL);

    // the part between i and j _is_ the field
    strncpy(fld, &s[i], j-i);
    fld[j-i] =' 0';
    return(fld);
}

The function in Listing 12.8 takes four arguments: a pointer to the loaded database (s), a character that is used as a field separator (c), the number of the field to be found (n), and a pointer to a buffer into which the field is to be copied (fld). The function assumes that there is no field separator before the first field and after the last, and it can be called as follows:

GetField(s, '#', 4, tmp);

This call will attempt to find the fourth field in the database and copy its value in the tmp (fields are separated by a '#'character).

But GetField() does not do all the work on its own; it makes use of a function called GetOccurrence() to find the next occurrence of a field separator. Listing 12.9 shows an implementation of GetOccurrence() .

Code Listing 12.9. Finding a Character in a String
int GetOccurrence(char *s, char c, int n)
{  // Return the position of the Nth occurrence of character c in s
  // Return -1 if not found
  int i, j=0;

      for (i=0; i < (int)strlen(s); i++)
      {
          if (s[i] == c)
          {
             j++;
             if (j >= n)
                return(i);
          }
      }
      return(-1);
}

Of course Listings 12.8 and 12.9 just show one way of finding fields. You can opt to write a GetRecord() function, for instance, which takes as an input argument the number of fields per record so it can locate a specific record in a database.

The TextFile Class

As you saw in the section "Efficient Binary File IO," in order to achieve fast file access it is important to use intelligent buffering. You also read in the introduction to this section that binary files and text files are basically the same thing, apart from how CR and LF are interpreted by the access functions. This means that the theory introduced in Efficient Binary File IO can also be used for text files, as long as CR and LF are treated correctly. This is good news because when you decide to use text files as data storage or configuration files, you still want fast access, even when the files become large. This section introduces a new class which does just that. It is called TextFile and can be found in the files 12Source03.cpp and 12Source03.h on the Web site.

The TextFile class essentially performs buffered binary file IO. This means that each file access will read a block of binary data from a file to a buffer, or write a block of binary data from a buffer to a file. Meanwhile, the user of the class can access the buffered data without incurring disc access overhead for each access. Basically this is what you already saw in the section Efficient Binary File IO; however, because the TextFile class knows that it is actually treating text files, it gives you some very handy functionality; it allows you to treat the buffer as a collection of lines of text—remember, a line in a text file is terminated by a combination of CR and LF (under UNIX, just LF). This means that you keep reading lines from the TextFile class as you need them, whereas it takes care of disc access and buffering. The same is also true for writing to a file.

Because the theory behind the TextFile class can be found in the section "Efficient Binary File IO," this section only gives a description of the TextFile interface and examples of how to use it. Table 12.11 contains the interface description of the TextFile class .

Table 12.11. TextFile Interface
Function: TextFile()
Type: Constructor
Remarks: The constructor initiates member variables.
Function: ~TextFile()
Type: Destructor
Remarks: The destructor calls the close function.
Function: Open()
Type: Public member function.
Argument 1: char*, pointer to a user-managed filename string. This string can contain path elements such as drive and directory names.
Argument 2: char (optional), either 'r' for an input file to read from, or 'w' for an output file to write to. The default value for argument 2 = 'r'.
Return value: int, 0 when the file could not be opened.
Remarks: Opens the specified file for the specified action.
Function: close()
Type: Public member function.
Remarks: Closes the current file if it is still open. When the file is opened for writing, the buffer is flushed to file if it is not yet empty.
Function: getline()
Type: Public member function.
Argument 1: char*, pointer to a user-managed buffer which receives the next line from the file.
Return value: int, 0 when the end of the file is reached.-1 when no file is opened or the file is opened for writing.n when successful, n denotes number of characters in the buffer including terminating zero.
Remarks: Reads the next linefrom the file into the user-managed buffer, unless the file is not opened for reading or the end of the file is reached.
Function: putline()
Type: Public member function.
Argument 1: char*, pointer to a user-managed buffer which contains the next line to be written to file.
Argument 2: int (optional), format to be used for line termination; options are CR, LF, CRLF, LFCR; by default this argument is set to CRLF.
Return value: int
Remarks: int, 0 when there is an error while writing to the file.-1 when no file is opened or the file is opened for reading.1 when successful.

You can use the TextFile class via normal instantiation, as the following example demonstrates:

char tmp[255];               // dependent on how long you make your file lines.
TextFile file;

if(file.open("test.txt") == 0)  // open default = 'for reading'.
    return;

while (file.getline(tmp) > 0)    // read all the lines.
{

And of course it is also possible to use it via inheritance, as the following example demonstrates:

class MyFileClass : public TextFile
{
    MyFileClass():TextFile(){ }
    MyOpen(char *s)
    {
        // do something;
        open(s);    // call to TextFile::open();

Searching Text Files

How efficiently you search through text files depends on two things: first, how efficiently you load data from the file, and, second, how efficiently you search through the loaded data. Both subjects have already been covered in this book. You have a very good basis for efficient text file searching when you combine the Fast Pattern Search introduced in Chapter 10 with the enhanced file IO made available through the TextFile class introduced in the previous section. The following example shows how you can combine these two concepts:

char tmp[255], *c;
int n;

init(SearchString);                   // init function for the search.
while ((n = file.getline(tmp)) > 0)    // read all the lines.
{
    if ((c = search(tmp, n)) != NULL) // search a line.
        break;

Doing a case-insensitive search for the SearchString() is accomplished by calling the case-insensitive versions of the fast pattern match functions (see Chapter 10). A more complicated and generic way of searching through text files is the topic of the following section.

Searching Using Wildcards

Searching with the aid of wildcards is used when a match is needed for a generic pattern instead of an exact string. This mechanism is often used in command-line interfaces such as DOS. A wildcard search to reveal all text files in a certain directory can be initiated, for instance, with the following command:

dir *.txt

The '*'indicates that any number of characters (even 0) and any kind of character can appear in that position. In this case it means that any string ending in '.txt'will match. Another special character that can be used in wildcards is '?', indicating that any (single) character can be placed in that position—sort of like a joker in a card game. The command

dir picture?.bmp

matches all files with the name 'picture'followed by one more character and ending with '.bmp'. Possible matches are picture1.bmp, picture2.bmp, pictureA.bmp, but not picture10.bmp or picture.bmp.

It is likely that you might want to perform such a wildcard match on a text file or a text string in some of your programs at some point. Think, for instance, of finding function names in a source file using the search pattern "*(*)". Listings 12.10 and 12.11 show a function that can be used for wildcard matching. Table 12.12 explains its interface.

Table 12.12. The CompareStr() Arguments
Variable Explanation
Argument 1 char*, a pointer to the string to compare with.
Argument 2 char*, a pointer to a string specifying the search string-possibly containing wildcard characters.
Argument 3 bool, Boolean indicating whether a case-sensitive search (0) or a case-insensitive search (1) is wanted.
Argument 4 bool, Boolean indicating whether argument 1 might contain additional characters after the pattern.
Return value bool, indicating whether argument 1 and argument 2 match according to wildcard rules. (0 = no match. 1 = match.)

Code Listing 12.10. Pattern Matching Using Wildcards, Part 1
#define anychars    '*'
#define anysinglechar  '?'

int CompareStr(char *s, char *pattern, bool ignoreCase, bool allowTrailing)
{
    long l, m, n = 0L;
    long lens = strlen(s), lenpattern = strlen(pattern);

    for (l=0, m=0; (l < lenpattern) && (m < lens); l++, m++)
    {  // walk the pattern and compare it to the string
        if (pattern[l] != anychars)   // normal character or '?' ?
        {
            if (pattern[l] != anysinglechar)
            {
                if (ignoreCase == true)
                {
                    if (toupper(pattern[l]) != toupper(s[m]))
                        return(false);
                }
                else
                    if (pattern[l] != s[m])
                         return(false); // Character MISMATCH
         }  // else NO prove that we don't match
        }
        else // We've got an expandable WILDCARD (*), Lets have a closer look!!
        {
        l++;
            for (n = m; (n < lens) ; n++)  // try to find a working combination
                if (CompareStr(&s[n], &pattern[l], ignoreCase, allowTrailing))
                     return(true);       // found == full match -> communicate
                                         // to callers
            return(false);      // couldn't find a working combination
        }
    }

The function CompareStr() walks through the string and the pattern, trying to match up the characters. As long as no wildcard characters are detected in the pattern, it does a normal character compare. However, as soon as a wildcard is detected something special has to happen. The '?'character (anysinglechar) is easy enough to deal with; it matches with any character in the string so no compare is actually needed. The '*'character (anychars) is more complicated. When a '*'is detected, a match must be found between the rest of the pattern after the '*'character and some part of the string. In order to do this, CompareStr() calls itself recursively, but each time starting with a different part of the string.

But this is not the end of the function, for at this point there are three possible situations:

  • The end of both the string and the pattern has been reached. There is nothing left to compare and no mismatch occurred, which means the pattern matches the string.

  • The end of the string has been reached but there are still characters left in the pattern. In this case, the string and the pattern match only if the remaining characters in the pattern are all '*'.

  • The end of the pattern has been reached but there are still characters left in the string. In this case the string and the pattern match only if the last character of the pattern is a '*'.

Listing 12.11 shows the code which takes care of these three cases.

Code Listing 12.11. Pattern Matching Using Wildcards, Part 2
    if (l < lenpattern) // Trailing *'s are OK too!
    {
        for (n = l; n < lenpattern; n++)
        if (pattern[n] != anychars)
                return(false);
    }
    else
    {
         if (pattern[l-1] != anychars)
        {
            if ((m < lens) && (allowTrailing == false))
                return(false);
        }
    }
    return(true);  // If we're still here, we've found a match!!
}

The CompareStr() function can also be found in the file 12Source04.cpp on the Web site.

A very simple way to do a pattern search on a larger string (or a text file loaded into memory), is shown in Listing 12.12.

Code Listing 12.12. Using CompareStr() to Do a Pattern Match on a Text File
char *resultArray[MAXRESULTS];
int  index = 0, n = 0;

while (n < lengthOfFile && index < MAXRESULTS)
{
    if (CompareStr(&file[n], pattern, 1, 1))    // Case Insensitive,
                                                // allow trailing.
        resultArray[index++] = &file[n];
    n++;
}

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset