Chapter III.3. Collections and Dictionaries

An array can be handy when you need to store the same type of information, such as a group of integers. However, if you need to store different information, such as both integers and strings, and you aren't sure how many items you need to store, you probably can't use an array. Instead, you can use a collection or a dictionary.

A collection acts like a resizable array that can hold different data types at the same time while identifying each chunk of data with a number. A dictionary acts like a collection that identifies each chunk of data with a unique key.

The purpose of both collections and dictionaries is to make it easier to store different types of data and retrieve them again with the size and single data type restrictions of an array.

Using a Collection

A collection acts like a super array that can grow and expand without requiring any special commands. In addition, a collection can store different data types (such as integers or strings) or even other data structures, such as an array.

Note

Not all programming languages offer the collection data structure:

  • In some programming languages (like Python and Smalltalk), collections are a built-in feature of the language.

  • In other languages (like C or Pascal), you have to use more primitive data structures (like arrays) to mimic the features of a collection.

  • In many newer languages (like C# and Visual Basic.NET), someone else has already created a collection out of more primitive data structures, so you can use them without knowing how they were created.

Because a collection is nothing more than a data structure, like an array, the first step to creating a collection is to declare a variable as a collection, such as the following Visual Basic.NET example shows:

Dim MyStuff as New Collection

This command simply identifies a MyStuff variable as a collection data structure. The New command tells the computer to create a new collection.

Adding data to a collection

When you first create a collection, it contains zero items. Each time you add a new chunk of data to a collection, it expands automatically so you never have to specify a size beforehand (like an array) or deliberately resize it later (like a dynamic array).

To add data to a collection in Visual Basic.NET, you must use an Add command like this:

Dim MyStuff as New Collection
MyStuff.Add ("Dirty socks")

Each time you add another element to a collection, the computer tacks that new data at the end of the collection. So if you added the string Dirty socks, the number 7348, and the number 4.39, the collection would look like Figure 3-1.

Collections expand automatically each time you store new data.

Figure III.3-1. Collections expand automatically each time you store new data.

Dim MyStuff as New Collection
MyStuff.Add ("Dirty socks")
MyStuff.Add (7348)
MyStuff.Add (4.39)

Every element in a collection gets numbered with the first element given an index number of 1, the second given an index number of 2, and so on, similar to an array.

Warning

Although some programming languages number the first element of an array as 0, they often number the first item in a collection with 1.

For greater flexibility in adding data to a collection, you may be able to insert data before or after existing data, depending on the programming language. (In Visual Basic.NET, you can add data before or after existing data in a collection, but in REALbasic, you can't.)

Suppose you had the following Visual Basic.NET code to create a collection:

Dim HitLIst as New Collection
HitList.Add ("Billy Joe")
HitList.Add (99)
HitList.Add ("Johnny McGruffin")

If you wanted to add the name Hal Perkins to the end of the collection, you could use this command:

HitList.Add ("Hal Perkins")

Rather than always adding data to the end of a collection, you can also specify that you want to store new data before or after a specific location in the array. The location is specified by the collection's index number.

So if you wanted to add the number 3.14 before the second element in a collection, you could use the following:

HitList.Add (3.14,,2)

Note

Don't worry about the extra space between the two commas. That extra space is reserved for using collections with keys (which you read about later in this chapter).

The preceding command inserts the number 3.14 before the second element of the collection, as shown in Figure 3-2.

If you wanted to add the name Gini Belkins after the third element in the collection, you could use the following command:

HitList.Add ("Gini Belkins",,, 3)
You can insert data before a specific location in an array.

Figure III.3-2. You can insert data before a specific location in an array.

This command would insert the name Gini Belkins after the third element in the collection, as shown in Figure 3-3.

You can insert new data after an existing location in a collection.

Figure III.3-3. You can insert new data after an existing location in a collection.

Note

Don't worry about the extra spaces between the three commas in the preceding command:

  • The first extra space is reserved for using collections with keys (which you read about later in this chapter).

  • The second extra space is used to store data before a location in a collection.

Deleting data from a collection

After you store data in a collection, you can always delete data from that collection. To delete data, you must specify the location of that data by defining an index number. So if you want to delete the fourth item in a collection, you'd specify deleting data stored at index 4 like this:

HitList.Remove (4)

When you delete data from an array, that array now contains empty space, but when you delete data from a collection, the collection automatically renumbers the rest of its data so there isn't any empty space, as shown in Figure 3-4.

When you remove data from a collection, the collection renumbers the rest of its data.

Figure III.3-4. When you remove data from a collection, the collection renumbers the rest of its data.

Identifying data with keys

One problem with collections is that they identify data by their position in the collection. If you don't know the location of specific data in a collection, you have to search the entire collection, item by item, to find specific data.

Because trying to find data in a collection by its location can be clumsy and unintuitive, collections give you the option of identifying data with any descriptive string, or a key.

For example, if you stored the name of your boss in a collection, you could identify it with the key boss. If you stored the name of your friend in a collection, you could identify it with the key best friend. And if you stored the name of your ex-boss in a collection, you could identify it with the key moron, as shown in Figure 3-5.

Identify data by its location or by a key.

Figure III.3-5. Identify data by its location or by a key.

When you add data to a collection, you can optionally also assign a key to that data, which you can later use to search and retrieve that data again. So if you wanted to add the data Mike Ross along with the key moron, you could use the following command:

HitList.Add ("Mike Ross", "moron")

Warning

When adding a key to data in a collection, your key must meet these criteria:

  • You must add a key at the same time you add data to a collection.

    After you add data to a collection, you can't go back later and add a key to that data.

  • Every key must be a string.

  • Every key must be unique; no two items in a collection can share the same key.

Searching and retrieving data

After you store data in a collection, here are two ways to search and retrieve data from that collection:

  • Use the index number of that data.

  • Use the key of that data.

Warning

If you don't store a key with data originally, you can't retrieve that data with a key.

Index numbers

To retrieve data based on its location, you can do something as simple as the following:

Dim Junk as New Collection
Junk.Add (3.1415)
Junk.Add (99)
Junk.Add ("Bo")

If you wanted to retrieve the name Bo from the collection, you'd have to know that Bo is stored as the third item (index number 3), so the following would store the string Bo in the Good variable:

Good = Junk.Item(3)

The problem with relying on index numbers alone is that as you add and delete items from a collection, the index numbers may change, as shown in Figure 3-6.

Retrieving data by index numbers is unreliable because they can change.

Figure III.3-6. Retrieving data by index numbers is unreliable because they can change.

Because index numbers don't always stay matched with each item in a collection, a better solution is to assign a unique string, or a key, to each item, as described in the following section.

Keys

By assigning a descriptive key to each item, you can use that key to retrieve that item no matter where it might be stored in the collection.

The following code assigns the key "pi" to the first item, the key "secret agent" to the second item, and the key "my cat" to the third item:

Dim MoreJunk as New Collection
MoreJunk.Add (3.1415, "pi")
MoreJunk.Add (99, "secret agent")
MoreJunk.Add ("Bo", "my cat")

To retrieve items from a collection with a key, you have to remember the key associated with each chunk of data. The following code stores the number 3.1415 into the CircleSecret variable:

CircleSecret = MoreJunk.Item ("pi")

The preceding code tells the computer to find the chunk of data assigned the "pi" key and then store that data in the CircleSecret variable.

The preceding code retrieves the number 3.1415 no matter where its location may be in a collection.

Warning

You can always retrieve data with either its key or its location (index number).

Using Dictionaries

Essentially, a dictionary is like a collection but with two additional advantages:

  • Searching for that data in a dictionary is much faster.

    Dictionaries use a data structure known as a hash table (which you read more about later in this chapter). Every item stored in a dictionary must include a unique key.

    Collections don't require keys because searching for data in a collection is sequential. Therefore, the computer must start at the beginning of the collection and examine each item one by one to retrieve a specific item. The more items in a collection, the slower the search process.

    Warning

    If a programming language offers collections, it usually also offers dictionaries. If a programming language doesn't offer collections, it probably doesn't offer dictionaries either.

  • In a dictionary, the key can be any value, including strings or numbers, which gives you greater flexibility in assigning keys to data.

    If a collection uses a key to identify data, that key must be a string.

Note

Dictionaries are also called associative arrays. When you store data and a key, that's known as a key-value pair (refer to Figure 3-5).

Like a collection, a dictionary is a data type, so you must first declare a variable as a dictionary. Then you must create a new dictionary and store data in that dictionary.

To create a dictionary in the Smalltalk programming language, you could use the following:

blackbook := Dictionary new.

This code declares the blackbook variable as a dictionary data type. The new command simply creates an empty dictionary.

Adding data to a dictionary

After you declare a variable as a dictionary data type, you can start adding data to that dictionary by defining both the data and the key you want to associate with that date. So if you want to add the name Dick Ross to a dictionary and assign it a moron key, you could use the following:

blackbook := Dictionary new.
blackbook at: 'moron'      put: 'Dick Ross'.

Warning

Every time you add data to a dictionary, you must include a corresponding key.

In Smalltalk, the key appears directly after the at command, and the data appears after the put command, as shown here:

blackbook := Dictionary new.
blackbook at: 'moron'      put: 'Dick Ross'.
blackbook at: 'imbecile'   put: 'John Adams'.
blackbook at: 'idiot'      put: 'Sally Parker'.

Searching and retrieving data from a dictionary

To access and retrieve data from a dictionary, you need to identify the dictionary variable and the key associated with the data you want to find. So if you wanted to find the data associated with the key idiot, you could use the following command:

blackbook at: 'idiot'

This would return:

'Sally Parker'

Note

Dictionaries are more efficient at searching and retrieving data because the computer doesn't need to search through the entire dictionary sequentially. Instead, the computer searches through data using a hash table. This is like the difference between looking through the phone book, starting from page one, trying to find the phone number of the Versatile Plumbing company, or just skipping straight to the V section of the phone book and then looking alphabetically from the beginning of that V section to find the phone number of Versatile Plumbing, as shown in Figure 3-7.

Understanding Hash Tables

If you stored data without a key in a collection, searching for a specific chunk of data is difficult because the data isn't sorted. So to ensure you find data in a collection, you must search the entire collection from start to finish. The more data you store, the longer the search takes, just as it takes longer to find a specific playing card in a deck of 52 cards compared to a deck of only 4 cards.

Hash tables make searching faster by dividing data into distinct sections.

Figure III.3-7. Hash tables make searching faster by dividing data into distinct sections.

When you store data with a unique key (a key-value pair), the key is used to help identify and retrieve the data. However, just using a key alone is no better than storing data alone because keys are just another chunk of data. The more data (and keys) you store, the longer it takes the computer to search through its entire list of keys.

Converting keys with a hash function

To speed up searching, dictionaries use hash tables. Basically, a hash table takes the keys used to identify data and then converts that key into a hash value. This hash value gets stored into a sorted list (table), as shown in Figure 3-8.

The exact method used to convert a key into a value is a hash function. The converted key, or hash, now points directly to the stored data. At this point, the computer actually stores just two chunks of data:

  • The data itself

  • A hash value calculated from the key

When you want to retrieve data, you give the computer the key associated with the data that you want. The computer takes the key and uses its hash function to convert the key to a value.

Hash tables convert keys into a numeric value.

Figure III.3-8. Hash tables convert keys into a numeric value.

Now the computer tries to match this calculated value to its list of values stored in the hash table. When it finds a match, it can then find the data associated with that key.

A simple hash function might just add up all the characters in a key and use that total as a value. For example, consider the keys moron and imbecile.

blackbook := Dictionary new.
blackbook at: 'moron'      put: 'Dick Ross'.
blackbook at: 'imbecile'   put: 'John Adams'.

Such a simple hash function could create a table like this:

Hash table

Data

5

Dick Ross

8

John Adams

If you wanted to find the data associated with the key moron, the computer would first calculate its hash value, which is five.

Next, it'd try to match this value with an existing value stored in the hash table. In this case, it finds that the hash value of 5 matches up to the key moron and the data Dick Ross, which is the data you wanted in the first place.

Note

Basically, a hash table works by searching through a sorted list of data (the hash values calculated from the key) rather than the unsorted list of data.

Hash function collisions

The hash function used to create a hash table can greatly determine the efficiency of that hash table. Ideally, the hash function should create a different hash value for every key. Unfortunately, that's not always possible, which means that sometimes the hash function can create identical hash values from different keys.

In the previous example, the hash function converted a key to a hash value just by counting the number of characters used in the key. So if two different keys have the same number of characters, the hash function will create the same hash value like this:

blackbook := Dictionary new.
blackbook at: 'moron'      put: 'Dick Ross'.
blackbook at: 'imbecile'   put: 'John Adams'.
Blackbook at: 'idiot'      put: 'Sally Evans'.

Using the simple hash function to count the number of characters in each key would create a table like this:

Hash table

Data

5

Dick Ross

8

John Adams

5

Sally Evans

Warning

Hash tables can't have duplicate values because every hash value must match with a single chunk of data. A collision occurs when a hash function creates duplicate values from different keys. Here are two ways to prevent collisions:

  • Develop a better hash function.

    Unfortunately, no matter how many different hash functions you create, the more data stored, the greater the chance that any hash function will eventually calculate a duplicate value from two different keys.

  • Find a way to deal with hash value collisions.

    The following sections provide solutions.

Solving collisions by chaining

The simplest way to deal with collisions (duplicate hash values) is chaining.

Normally, each hash value points to a single chunk of data. The idea behind chaining is that each hash value can actually point to a list of data, as shown in Figure 3-9.

Chaining lets a single hash value point to a list of multiple items.

Figure III.3-9. Chaining lets a single hash value point to a list of multiple items.

Now if you search for data using a key, the computer

  1. Calculates the hash value of that key, which points to a list of data

  2. Searches through this list sequentially, one by one

Note

Chaining works because searching a shorter list sequentially is faster than searching the whole list sequentially. (It's like finding the Versatile Plumbing company in the phone book by starting with the V section instead of the first page of the book.)

Avoiding collisions with double hashing

Another way to avoid collisions is to use double hashing:

  1. The hash function calculates a hash value for each key.

  2. If a collision occurs, the computer calculates a second hash value.

Essentially, you wind up with a hash table within another hash table, as shown in Figure 3-10.

Double hashing creates miniature hash tables within a larger hash table.

Figure III.3-10. Double hashing creates miniature hash tables within a larger hash table.

Double hashing can reduce the number of duplicate hash values (collisions), but here are a couple of drawbacks:

  • A collision can occur even after the double hashing.

  • Double hashing is a more complicated solution than chaining.

The more complex a program, the greater the chances of something going wrong.

In general, the simpler the data structure (such as arrays), the easier they are to implement and the less chance that something goes wrong. Of course, the simpler the data structure, the more restrictive the data structure.

Both collections and dictionaries (using hash tables) give you added flexibility but at the cost of added complexity:

  • In many programming languages, such as C, you have to create dictionaries and hash tables from scratch.

  • In other programming languages, such as C# or Python, collections and dictionaries are built-in features of that language, which makes these data structures easy to use and implement.

Sometimes you may find collections or dictionaries easier to use and sometimes you may find arrays or sets easier to use. By understanding the different types of available data structures and the pros and cons of each, you're more likely to choose the best one that makes it easy to solve your particular problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset