The Theory of Sorting Data

Apart from comparing blocks of data, you will often find that you need to sort a particular set of data. In a base of sorted data, you can locate elements faster and it becomes easier to present output in a certain logical format. Think, for instance, of representing the same information in different contexts: a list of employees ordered by day of birth for a birthday card list, the same list of employees ordered by salary for cost analysis, and so on. It is easier (and faster) for a program to group data according to a certain criterion—for example, an unsorted telephone directory.

In the previous sections, you have already read about a basic mechanism used in sorting—namely, matching patterns of data. Sorting is in effect ordering data on the basis of a comparison. The element—or part of the element—that is used in this comparison is called the key . When a database of employee records is sorted on employee name, the employee name is the sorting key. When the database is then sorted using employee age, the age is the sorting key. This section discusses sorting theory further. Before we get down to the actual sorting algorithms, though, there will be some general theory on sorting.

Algorithm Issues

The Good, the Bad, and the Average

In comparing sorting algorithms, it is important not only to look at the worst-case performance but also at the best-case and the average-case. Average case is, of course, the one you will run into most of the time; best case is what you will want to demo to a potential client. Worst case and average case can be very different; they can even differ in orders of magnitude, as you will see later with the quick sort algorithm. Still this does not necessarily mean that you will discard an algorithm based solely on worst-case characteristics. Worst case might happen very seldom and may even be predictable or acceptable. Let's say on average a certain sorting algorithm takes two seconds to sort the office's financial data; it may be acceptable for this to take up to five minutes once or twice a month. A smart employee will let this inconvenience coincide with a caffeine break. You may even try to detect possible worst-case scenarios inside the sorting algorithm—as the section on sorting techniques shows—but to get this detection completely watertight will of course make your algorithm impossibly slow. (It would mean checking all the data.)

Algorithm Footprints

It is important to realize that the runtime footprint of a sorting algorithm is often more than the space that the actual code of the sorting algorithm takes up. One reason for this is that some sorting algorithms are what is called in-place, and others are not. A sorting algorithm is in-place when the data set stays the same size (or a little over that size) while it is being sorted. A sorting algorithm which is not in-place will need more space for the data set when it is sorting. For small databases this is unlikely to be a problem but for large bases this can seriously constrain the maximum database size. This is because a percentage of the base size in memory needs to be reserved for sorting purposes. Another reason why the runtime sorting algorithm footprint can be larger than that of the code is the needed stack frame. When a sorting algorithm uses recursion (see Chapter 8, "Functions") the runtime footprint can suddenly boom during sorting.

Algorithm Stability

Another characteristic of sorting algorithms is something called stability. A sorting algorithm is called stable when it leaves elements with identical keys in the same order during sorting. This means that when an unsorted database contains two employees with the same name, these employees will emerge in the sorted database in the same order when employee name is the sorting key. Although this stability might seem obvious, not many sorting algorithms guarantee this. Of course, it depends on the requirements of your program whether stability is necessary or not.

Algorithm Simplicity

The final sorting algorithm characteristic of this section is the complexity of the implementation itself. So far you have read only about the mathematical complexity, but this does not have to correspond directly to the complexity of the software that implements the sorting algorithm. Algorithms which are relatively easy to implement have the advantage that development and maintenance time is low and chances of bugs are fewer. These advantages can translate directly into saving money on development. When it is acceptable for a certain sorting part of a program to be slow, the choice can be made for an algorithm that is easy to implement and maintain instead of a brilliantly fast algorithm that only the implementer understands on the day that he writes the code. In general, of course, the fastest sorting algorithms do tend to be the more complex ones to implement, as you will see in the section "Sorting Techniques."

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset