Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Hash tables

A hash table is a completely different kind of searchable structure. The idea starts from what is called a hash function. It is a function that gives an integer for any value of the desired type. For example, the hash function for strings must return an integer for every string. Java requires every class to have a hashcode() method. The object class has one method implemented by default, but we must override the default implementation whenever we override the equals method. The hash function holds the following properties:

Same values must always return the same hash value. This is called consistency of the hash. In Java, this means if x and y are two objects and x.equals(y) is true, then x.hashcode() == y.hashcode().
Different values may return the same hash, but it is preferred that they don't.
The hash function is computable in constant time.

A perfect hash function will always provide a different hash value for different values. However, such a hash function cannot be computed in constant time in general. So, we normally resort to generating hash values that look seemingly random but are really complicated functions of the value itself. For example, hashcode of the String class looks like this:

public int hashCode() {
    int h = hash;
     if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
         hash = h;
    }
    return h;
}

Notice that it is a complicated function that is computed from constituent characters.

A hash table keeps an array of buckets indexed by the hash code. The bucket can have many kinds of data structures, but here, we will use a linked list. This makes it possible to jump to a certain bucket in constant time and then the bucket is kept small enough so that the search within the bucket, even a linear search, will not cost that much.

Let's create a skeleton class for our hash table:

public class HashTable<E> {
    protected LinkedList<E> [] buckets;
    protected double maximumLoadFactor;
    protected int totalValues;
    public HashTable(int initialSize, double maximumLoadFactor){
        buckets = new LinkedList[initialSize];
        this.maximumLoadFactor = maximumLoadFactor;
    }
    …
}

We accept two parameters. InitialSize is the initial number of buckets we want to start with, and our second parameter is the maximum load factor.

What is load factor? Load factor is the average number of values per bucket. If the number of buckets is k and the total number of values in it is n, then load factor is n/k.

Insertion

Insertion is done by first computing the hash and picking up the bucket in that index. Now firstly, the bucket is searched linearly for the value. If the value is found, insertion is not carried out; otherwise, the new value is added to the end of the bucket.

First we create a function for inserting in a given array of buckets and then using it to perform the insertion. This would be useful when you are dynamically growing your hash table:

    protected boolean insert(E value, int arrayLength,
                             LinkedList<E>[] array) {
        int hashCode = value.hashCode();
        int arrayIndex = hashCode % arrayLength;
        LinkedList<E> bucket = array[arrayIndex];
        if(bucket == null){
            bucket = new LinkedList<>();
            array[arrayIndex] = bucket;
        }
        for(E element: bucket){
            if(element.equals(value)){
                return false;
            }
        }
        bucket.appendLast(value);
        totalValues++;
        return true;
    }

Note that effective hash code is computed by taking the remainder of the actual hash code divided by the number of buckets. This is done to limit the number of hash code.

There is one more thing to be done here and that is rehashing. Rehashing is the process of dynamically growing the hash table as soon as it exceeds a predefined load factor (or in some cases due to other conditions, but we will use load factor in this text). Rehashing is done by creating a second array of buckets of a bigger size and copying each element to the new set of buckets. Now the old array of buckets is discarded. We create this function as follows:

    protected void rehash(){
        double loadFactor = ((double)(totalValues))/buckets.length;
        if(loadFactor>maximumLoadFactor){
            LinkedList<E> [] newBuckets = new LinkedList[buckets.length*2];
            totalValues = 0;
            for(LinkedList<E> bucket:buckets){
                if(bucket!=null) {
                    for (E element : bucket) {
                        insert(element, newBuckets.length, newBuckets);
                    }
                }
            }
            this.buckets = newBuckets;
        }
    }

Now we can have our completed insert function for a value:

    public boolean insert(E value){
        int arrayLength = buckets.length;
        LinkedList<E>[] array = buckets;
        boolean inserted = insert(value, arrayLength, array);
        if(inserted)
            rehash();
        return inserted;
    }

The complexity of insertion

It is easy to see that the insert operation is almost constant time unless we have to rehash it; in this case, it is O(n). So how many times do we have to rehash it? Suppose the load factor is l and the number of buckets is b. Say we start from an initialSize B. Since we are doubling every time we rehash, the number of buckets will be b = B.2 ^R; here R is the number of times we rehashed. Hence, the total number of elements can be represented as this: n = bl = Bl. 2 R. Check this out:

lg n = R + lg(Bl) .
=> R = ln n – lg (Bl) = O(lg n)

There must be about lg n number of rehashing operations, each with complexity of O(n). So the average complexity for inserting n elements is O(n lg n). Hence, the average complexity for inserting each element is O(lg n). This, of course, would not work if the values are all clustered together in a single bucket that we are inserting into. Then, each insert would be O(n), which is the worst case complexity of an insertion.

Deletion is very similar to insertion; it involves deletion of elements from the buckets after they are searched.

Search

Search is simple. We compute the hash code, go to the appropriate bucket, and do a linear search in the bucket:

public E search(E value){
    int hash = value.hashCode();
    int index = hash % buckets.length;
    LinkedList<E> bucket = buckets[index];
    if(bucket==null){
        return null;
    }else{
        for(E element: bucket){
            if(element.equals(value)){
                return element;
            }
        }
        return null;
    }
}

Complexity of the search

The complexity of the search operation is constant time if the values are evenly distributed. This is because in this case, the number of elements per bucket would be less than or equal to the load factor. However, if all the values are in the same bucket, search is reduced to a linear search and it is O(n). So the worst case is linear. The average case of search is constant time in most cases, which is better than that of binary search trees.

Choice of load factor

If the load factor is too big, each bucket would hold a lot of values that would output a bad linear search. But if the load factor is too small, there would be a huge number of unused buckets causing wastage of space. It is really a compromise between search time and space. It can be shown that for a uniformly distributed hash code, the fraction of buckets that are empty can be approximately expressed as e^-l, where l is the load factor and e is the base of a natural logarithm. If we use a load factor of say 3, then the fraction of empty buckets would be approximately e^-3 = 0.0497 or 5 percent, which is not bad. In the case of a non-uniformly distributed hash code (that is, with unequal probabilities for different ranges of values of the same width), the fraction of empty buckets would always be greater. Empty buckets take up space in the array, but they do not improve the search time. Therefore, they are undesirable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Hash tables

Create new playlist

Sign In