Hash Tables

The previous paragraphs showed that arrays and linked lists each have their own strengths and weaknesses. The array provides fast access to its elements but is difficult to extend; the linked list is easy to extend but does not provide very fast access to its elements. For bases of data with a large number of elements, you of course want the best of both worlds. The hash table provides a means to bring this about. By combining the implementations of arrays and lists, a hash table that is set up well is both fast to access and easy to extend.

Hash tables come in as many shapes and sizes as implementers and designers can imagine, but the basic premise is graphically explained in Figure 11.4.

Figure 11.4. Hash table example.


The hash table depicted in Figure 11.4 is basically an array of linked lists. As with the skip list from the previous section, the hash table helps you by ensuring that you no longer have to traverse a complete list to find an element. This is because the elements are stored in several smaller lists. A hash table is therefore a sort of skip array. Each list maintained by the array of the hash table is called a bucket. A bucket is thus identified by a single (hash) array element. In Figure 11.4 the hash table contains five buckets numbered 0–4. In order to determine which bucket could contain the element you are looking for, you need to have specific information on how the array orders its lists. As with sorting, the place where elements are stored in a hash table is determined by one of the fields of that element. This field is called the key. Elements containing employee data could, for instance, be stored according to Social Security number or name. The function that maps key values of elements to buckets (or indexes of the hash table array) is called the hash function. Here is an example of a simple hash function:

unsigned int hash(int keyvalue)
{
   return (key % HASH_ARRAY_SIZE);
}

This hash function divides the key value by the size of the hashing array and uses the remainder as an index. Note that when the size of the hash table is a power of two, the costly modulo % operator can be replaced by a simple bitwise mask. For example, when hash table size is 128, the statement % tableSize can be replaced by &&0x7F. This way only bucket values ranging from 0–127 are in fact returned. Table 11.1 gives an overview of the most important hash table terminology.

Table 11.1. Hash Table Terminology
Term Definition
Key That part of the element to store, by which the elements of the database will be sorted, searched, and so on (also called hash key).
Bucket A single position in the hash array, behind which another structure may lie to contain the elements of that bucket.
Hash Function A function that maps a key value to a bucket.
Perfect Hash Function A hash function that maps each unique key value to a unique integer

It is not hard to imagine that the quality of the hash function directly determines the quality of the hash table. When a hash function maps too few or too many key values to a single bucket, the hash table will become very inefficient. The following two figures present extreme examples of what could happen. Figure 11.5 shows what happens when too few elements are mapped to a single bucket.

Figure 11.5. All buckets are used but contain only a single element.


When a hash function maps each key value to a separate bucket (perfect hash function), all buckets will eventually be in use but will contain only a single database element. What you have now is in effect an inefficient array. Each array slot contains a pointer to a single element. This means that memory (footprint) is used unwisely and element access is done through an extra indirection. But the good thing is that it's resizable! Figure 11.6 shows what happens when too many elements are mapped to a single bucket.

Figure 11.6. One bucket is used, containing all elements.


When a hash function maps each key value to the same bucket, only one bucket is used and it contains a list of all the database elements. What you have now is in effect a linked list with the added overhead created by an (almost empty) array.

It is clear that an ideal hash table would look far more balanced. In order to create a balanced hash table, the hash function needs to be able to divide the data elements evenly over the available buckets. Remember, however, that the more complicated this function is, the slower it will be. Hash functions are discussed in more detail in a following section.

Of course, hash table characteristics themselves also play an important part in what a hash table will look like when it is in use. Think, for instance, of the number of buckets used. With too few buckets, the lists that contain the bucket elements will become sufficiently large to slow down element access once more. Analysis of the data you will store should help you determine the desired bucket values. Another thing to consider is how you want to store elements within a bucket. The simplest way is to just append new elements to the end of the list of a bucket. With sufficient buckets (and thus short lists per bucket) this may prove to be very practical. However, as list sizes increase, it may be a good idea to think about storing the elements inside a bucket in an ordered fashion.

At the beginning of this section it was noted that hash tables can come in many shapes and sizes. Basically, the data set you want to store will determine the optimal hash table implementation. An array of lists is a much-used mechanism, but there is no reason why a bucket cannot contain a simpler or even more complex data structure. Think of a bucket containing a data tree or yet another hash table. This second level hash table might even use a different hash function.

The file 11source01.cpp that can be found on the Web site compares the performance of hash tables against that of other data structures. There is a section at the end of this chapter that presents the timing results of these tests.

Inserting

Inserting elements into a hash table that is set up according to the example in this section can be a very fast operation (O(1)). This operation consists of determining the correct bucket (via the hash function) and adding the element to the linked list of that bucket. When the linked list of the bucket is ordered, the operation can take longer because the correct place for insertion needs to be determined. In the worst case, this is an O(L) operation, in which L is the number of elements already in the bucket.

Deleting

Deleting an element from a hash table that is set up according to the example in this section is also an O(L) operation (worst case), in which L is the number of elements in the bucket from which an element will be deleted. This is because, whether or not the list is ordered, the element to be deleted has to be found.

Searching

Searching for an element in a hash table that is set up according to the example in this section can be very fast. Searching ranges from an O(1) operation (best case) to an O(L) operation (worst case), in which L is the maximum number of elements in a bucket.

Traversing

Traversing a hash table that is set up according to the example in this section can be quite easy when all the possible key values are known. In that case, the hash function can be called for ascending key values, thus finding the buckets. Within the buckets, the elements can be traversed in the same manner as with any linked list. However, when all possible key values are not known, there is little left to do but traverse the available buckets in some order—ascending, for instance.

Sorting

Sorting hash tables is not generally done. The hash function is used in order to make a mapping from key value to bucket. Doing any external sorting on this mapping means that elements can no longer be found via the hash function. However, few hash functions also dictate the order of elements within a bucket. As stated before, it can prove to be a good idea to keep the elements in the buckets in some kind of predetermined order. The sorting mechanism that can be used is dependent on the structure that is used for the buckets. For a hash table that is set up according to the example in this section, the sorting implications for buckets are the same as those specified in the section on sorting linked lists.

Hash Functions

In the previous section, you saw that finding elements in a hash table is fastest when you use a perfect hash function. This function guarantees that each possible key value will be mapped to a unique bucket number. This implies that each bucket contains at most a single element, and your hash table is in effect an array in which you can use the key value as an index. However, perfect hash functions are not always easy to find. When more than one key value is mapped to a bucket number by a certain hash function, this is called a collision. When the hash function cannot satisfactorily be rewritten into a perfect hash function, you have two options for dealing with collisions:

  1. Allow more than one element to be placed in a bucket.

    This option was demonstrated in the previous section, where each bucket basically consisted of a linked list. The number of elements that can be placed in such a bucket is limited only by memory. Using linked lists to represent buckets is called chaining , because the elements in a bucket are chained together. Other structures are of course also possible, such as trees, arrays, or even further hashing tables.

  2. Do a fix on the collision.

    It is also possible to try to fix collisions. In doing this you repeatedly look for a new bucket until an empty one is found. The important thing to realize here is that the steps to find an empty bucket must be repeatable. You must be able to find an element in the hash table using the same steps that were taken to store it initially. Collisions can of course be fixed in numerous ways. Table 11.2 contains popular collision fixes.

Table 11.2. Popular Hash Table Collision Fixes
Fix What You Do
Linear probe Do a linear search starting at the collision bucket until you find an empty bucket.
Rehashing Perform another hashing, with a different hash function, perhaps even on a different key of the element.
Overflow area Reserve a separate area for colliding keys. This could be a second hash table, perhaps with its own hash function.

Requirements for hash functions are clearly pretty strict. If a hash function is not a perfect hash function, it at least has to be perfect in combination with other strategies (option 2) or it should allow a good average distribution of elements over available buckets (option 1). In both cases though, the hash function you use for your hash table is dependent on the kind and amount of data you expect, and the nature of the key that is used. This section shows some sample hash functions for different data types. Listing 11.3 shows a simple hash function for string keys.

Code Listing 11.3. Simple Hash Function for String Keys
// Hashing Strings
int Hash(char* str, int tableSize)
{
   int h = 0;
   while (*str != ' 0')
      h += *str++;

   return h % tableSize;
}

The hash function in Listing 11.3 simply adds the character values of a string and uses the modulo operator to ensure the range of the returned bucket index is the same as the range of the available buckets. This hashing function does not take into account the position of characters, therefore anagrams will be sent to the same bucket. If this is undesirable, a distinction can easily be made by, for instance, multiplying each character by its position or adding a value to a character that depends on its position. Of course, you can also decide to write a hash function that uses two or more fields of your own data types. Listing 11.4 shows a hash function for a user-defined date key.

Code Listing 11.4. Hash Function for User-Defined Date Keys
// Hashing Dates
struct date

{
   int   year;
   char  day;
   char  month;
};

int Hash(date *d, int tableSize)
{
   if (tableSize <= 12)
      return (int) (d->month % tableSize);
   if (tableSize <= 31)
      return (int) (d->day % tableSize);

   return (((d->year << 9) +
            (d->month << 5) +
            (d->day))
            % tableSize);

}

The hash function in Listing 11.4 uses a number of the fields of a data type, dependent on the number of available buckets. When this date type is used with a hash table of 12 buckets or less, the month is used as a bucket index, taking into account again the number of available buckets. If this type is used with a hash table of 31 buckets or less, the day is used as a bucket index. For larger numbers of buckets a combination of year, day, and month is used as a bucket index. With this example date type it is not hard to imagine that a hash function can be refined using analytical information on the expected data. If, for instance, a hash table is used to store insurance information using claim dates, you might expect a large number of elements to be added around the month of December. Similarly, the months June and July might generally have less than their fair share of claims. A good hash function could take these kinds of known characteristics into account by spreading the load of December over several buckets and/or combining the load of the June and July buckets.

Another way to determine a key that is based on a numeric value is the multiplication method. Instead of using division, as was done in the first sample hash function, multiplication is used. In the multiplication method, the hash function determines the bucket by multiplying the key with a constant and then taking the most significant bits of that outcome. Again, this is simplest when the hash table size—and therefore the range of the buckets—is a power of two. This is because you can take a fixed number of bits from the multiplication outcome. You do this by shifting the outcome to the right over the number of least significant bits that you want to drop:

hash = (key * c) >> nr_of_bits_to_skip;

For instance, when the key is an 8-bit value, and the constant c has the value of 128, the maximum value of the expression (key * c) will be 255 * 128 = 32640. This value can be represented by a 15-bit number (2^15 = 32768). When the hash table size is 256 buckets, you want a bucket index that is 8 bits in size; therefore, the nr_of_bits_to_skip is then 15 – 8 = 7. This means that by doing seven bitwise shifts to the left, you end up with the eight most significant bits of the multiplication outcome as your bucket index. Similarly, when the hash table size is only 128 buckets, the nr_of_bits_to_skip equals 15 – 7 = 8. This is because the bucket index must range from 0–127, which can be captured in a 7-bit answer. There is one problem left, however, which is determining the optimal value of the constant c. You can run tests on a representative data set to determine the value of c that gives you a desirable spread of elements over buckets. Knuth (in his book The Art of Computer Programming) suggests using the following value for c:

range_of_the_key * R.

in which the range_of_the_key is 256 for an 8-bit key and so on and R is ((5^0,5)–1)/2).

Here are some example calculations for multiplication hash functions using Knuth's suggested constants:

Example 1:
Key size       : 8 bits
Hash table size: 256 buckets

hash = (key * 158) >> 7;


Example 2:
Key size       : 16 bits
Hash table size: 128 buckets

hash = (key * 40503) >> (32 – 7);

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset