This section looks at the different basic techniques that can be used for manipulating binary files. The different techniques are discussed in separate sections. A Test Results subsection at the end of this section does an execution speed comparison of the techniques.
The FILE functions can be used to read to and write from standard IO. These functions were already part of C but can still be used in C++. Table 12.3 shows the various FILE functions.
fopen | Open a file for reading/write of text/binary content |
fclose | Close an opened file |
fread | Read a block of data from an opened file |
fwrite | Write a block of data to an opened file |
fprintf | Print to an opened file (text) |
fscanf | Read variable values from an opened file (text) |
Listing 12.2 shows how FILE functions can be used to read data from one binary file and write it to another.
As was noted in Table 12.3, the fread() and fwrite() functions can be used to transfer blocks of bytes. These functions use two arguments to determine the block size. The first is the item size, denoting the size in bytes of the items in the file. The second is the number of items to be read or written. Item size multiplied by number of items equals the number of bytes transferred. In Listing 12.2, an item size of 1 was chosen because bytes are to be written and read. The number of items, therefore, effectively determines the block size. Choosing a small block size of course means extra overhead because more fread() and fwrite() calls are needed to transfer a file. Choosing a large block size means less overhead but a larger footprint in memory. The Test Results subsection lists an execution speed comparison between different binary file techniques by varying block sizes.
Of course, you will not always access files sequentially as was done in the previous listings. When you know how a file is built up—because it contains database records, for instance—you will want to access specific records directly without having to load the file into memory from its very beginning. What is needed is random access. You want to tell the file system exactly which block of bytes to load. As you have seen in the previous listings, a file is identified and manipulated through a pointer to a FILE object. This FILE object contains information on the file. It keeps, among other attributes, a pointer to the current position in the file. It is because of this attribute that it is possible to call, for instance, fread() several times, with each call giving you the next block from the file. This is because the current position in the file is increased after each call with the number of bytes read from the file. Luckily, you can influence this current position directly through the fseek() function. fseek() places the new current position of a file at an offset from the start, the end, or the current position of a file:
int fseek(FILE *filepointer, long offset, int base);
fseek() returns 0 when it executes successfully. Table 12.4 shows the different definitions that can be used for the third argument, base.
SEEK_CURR | Current position in the file |
SEEK_END | End of the file |
SEEK_SET | Beginning of the file |
Listing 12.3 demonstrates random file access using fseek() to load every tenth record of a file containing 100,000 records.
In the Test Results section, FILE random access is compared to stream random access.
The classes ifstream and ofstream can be used to read and write to and from files in C++. And of course C++ would not be C++ if it did not try to make life easier by deriving different classes from these two. Table 12.5 shows the different classes that can be used to perform file input as well as output.
fstream | File stream class for input and output |
stdiostream | File stream for standard IO files |
Table 12.6 shows the functions available on the C++ IO streams.
Table 12.7 shows the functions associated with IO streams which can be used to assess the status of a stream.
rdstate | Returns current IO status (possible returns: goodbit | failbit | badbit | hardfail) |
good | Returns != 0 when there is no error |
eof | Returns != 0 when the end of the file is reached |
fail | Returns a value != 0 when there is an error, use rdstate to determine error |
bad | Returns a value != 0 when there is an error, use rdstate to determine error |
clear | Resets the error bits |
Now look at how a C++ programmer can use a stream class to perform the same read/write behavior as was done in the section "Using FILE Functions." Listing 12.4 shows how stream functions can be used to read data from one binary file and write it to another.
Note that some useful flags can be found in Listing 12.4, which can be used when opening a stream. The flags are defined in C++ and can be used to force a certain behavior; ios::binary, for instance, opens a stream in binary mode. For a full list of flags consult language or compiler documentation.
Listing 12.4 actually reads and writes single characters from and to files. This is equivalent to using a block size of 1 for the functions discussed in the section "Using FILE Functions." Reading and writing can be done faster by choosing larger blocks of data to transfer, as you will see in Listing 12.5. Compare this with Listing 12.4 to compare speed between different techniques. The results of this test can be found in the later section "Test Results."
Apart from using a certain block size, it is also possible to define a buffer for a stream. Writing to a buffer can be faster because output is flushed only when the buffer becomes full. In effect, a buffer is used to combine several smaller write or read actions into a single large action. Listing 12.6 shows how buffers can be added to streams to collect read and write actions.
You will find a speed comparison between different stream accesses and traditional FILE functions in the Test Results section later on in this chapter.
As was the case with the FILE functions, streams also allow you to access files randomly. In the introduction to this section, you saw the two stream methods that allow you to do this: seekg() and seekp(). seekg() is used to manipulate the current pointer for input from the file (g = get), and seekp() is used to manipulate the current pointer for output to the file (p = put). Both seekg() and seekp() can be called with either one or two arguments. When one argument is given, this is seen as the offset from the beginning of the file; when two arguments are given, the second argument denotes the base of the offset. Values for this base can be found in Table 12.8.
ios:curr | Current position in the file |
ios::end | End of the file |
ios::beg | Beginning of the file |
Listing 12.7 demonstrates random file access using seekp() and seekg() to load every tenth record of a file containing 100,000 records.
In the Test Results section, stream random access is compared to FILE random access.
This section compares the speed of the different techniques for transferring data to and from binary files. The program in the file 12Source02.cpp that can be found on the Web site can be used to perform these timing tests.
The program reads its input from a file test.bin, which you should place in directory c: mp. The test works best when you use a file of at least 65KB, as a block size of 65KB is used in one of the tests. The program writes its output into the file c: mp est.out, which it will create itself. Of course it is also possible to use different paths and filenames, just be sure to adjust the inputFilename string in main() accordingly.
Table 12.9 presents the timing results generated by the program in the file 12Source02.cpp for reading and writing binary files.
Technique | Block Size | Time |
---|---|---|
StdIO | 1 | 4120 |
StdIO | 4096 | 260 |
StdIO | 65536 | 110 |
StreamIO | 1 | 1540 |
StreamIO | 4096 | 380 |
StreamIO | BUFFERED | 220 |
The first three rows of Table 12.9 show the results for the FILE functions using different block sizes (1, 4,096, and 65,536 bytes). Rows 4 and 5 show the results for the stream class with different block sizes (1 and 4,096) and row 6 shows the results for the stream class using buffered IO.
For the random-access test, the program in the file 12Source02.cpp creates a file called test.db in the directory c: mp. Of course, it is also possible to use a different path and/or filename, just be sure to adjust the dbFilename string in main() accordingly.
Table 12.10 presents the timing results generated by this program.
Records | Stdio | Streams |
---|---|---|
100000 | 970 | 770 |
500000 | 6771 | 5310 |
1000000 | 22080 | 10990 |
In Table 12.10, you not only see that streams seem to be faster in random access than standard IO, but their advantage increases as more records are read. The reason for this is that the fstream class—which is used in this test—gets a buffer attached to it by default. This means that some IO requests will read from this buffer instead of from the file when the required data happens to be buffered. The standard IO functions cannot easily use a larger block size to retrieve more than a single block from the file because the blocks are no longer found sequentially. Once again this proves that it is crucial to think carefully about what kind of buffer and block size to use.