Chapter IV.1. Sorting Algorithms

Every program handles data (numeric or text). Besides saving data, most programs also need to organize that data in some way, which involves sorting that data in a specific order, such as alphabetically or numerically. A database needs to sort names alphabetically by last name or by sales region whereas a video game needs to sort the top ten highest scores.

Despite the simple idea behind sorting a list of names or numbers, sorting is practically a field of computer science in itself. Computer scientists constantly study different ways to sort through data to find the fastest, most efficient method possible. Each of these different sorting methods is a sorting algorithm.

An algorithm is a fancy term for a method of doing something, so a sorting algorithm is a specific method for telling the computer how to sort data. The reason computer scientists keep creating and studying sorting algorithms is because no single sorting algorithm is best for all purposes.

Some sorting algorithms are easy to create but work slowly. Other sorting algorithms are much harder to create but work much faster. Ironically, some sorting algorithms work horribly when sorting a small number of items, such as a dozen numbers, but work quickly when sorting thousands of items.

Four factors for considering sorting algorithms include

  • Ease of implementation

  • Speed

  • Memory requirements

  • Stability

Ease of implementation defines how complicated the sorting algorithm is to implement in any programming language. Some sorting algorithms are easy to write but slow in actual use. Other sorting algorithms are much harder to write but perform much faster.

Speed measures how fast the algorithm can sort data of different sizes. Some sorting algorithms work quickly with small lists but slow down dramatically when dealing with larger lists. Other sorting algorithms work quickly when a list is mostly sorted but slow to a crawl when working with completely unsorted lists.

Memory requirements define how much memory the sorting algorithm needs to run. Some sorting algorithms can accept and sort data while they receive the data, which is an online algorithm. (An offline algorithm has to wait to receive the complete set of data before it can even start the sorting process.) Sorting algorithms that use recursion (see Book II, Chapter 6) may be easy to implement and fast but can require lots of memory to sort large amounts of data.

Note

Another factor that determines a sorting algorithm's memory requirements is whether the algorithm is an in-place algorithm. An in-place algorithm can sort data by using the existing data structure that holds the data. For example, if you have an array of numbers, an in-place algorithm can sort the data in that array without needing a separate data structure to hold the data temporarily.

Stability refers to whether a sorting algorithm preserves the order of identical data. Suppose you had a list of first and last names, such as John Smith and Mary Smith, and the name John Smith appears at the beginning of a list and Mary Smith appears at the end. A stable sorting algorithm sorts both names and keeps John Smith ahead of Mary Smith, but an unstable algorithm moves Mary Smith ahead of John Smith. (Out of all the sorting algorithms presented in this chapter, heap sort is the only unstable algorithm.)

Warning

Ultimately, there's no perfect sorting algorithm; however, a trade-off in size, speed, and ease of implementation finds the best sorting algorithm.

Using Bubble Sort

The simplest way to sort any amount of data is to start at the beginning and compare the first two adjacent items. So if you need to sort a list of numbers, you compare the first number with the second number. If the first number is bigger, you swap its place with the second number. If the second number is bigger, you don't do anything at all.

After comparing the first two items, you move down to comparing the second and third items, and so on. Repetitively comparing two adjacent items is the basic idea behind the bubble sort algorithm, as shown in Figure 1-1. The bubble sort algorithm gets its name because small values tend to bubble up to the top.

A bubble sort repetitively compares two adjacent items.

Figure IV.1-1. A bubble sort repetitively compares two adjacent items.

Basically the bubble sort algorithm works like this:

  1. Compare two adjacent items.

  2. Swap the two items if necessary.

  3. Repeat Steps 1 and 2 with each pair of adjacent items.

  4. Repeat Steps 1-3 to examine the entire list again until no swapping occurs and then the list is sorted.

Although the bubble sort algorithm is easy to implement, it's also the slowest and most inefficient algorithm because it must examine an entire list multiple times. For sorting a small list of pre-sorted data, the bubble sort algorithm works efficiently. For sorting large lists of unsorted data, any other sorting algorithm is much faster than the bubble sort algorithm, as shown in Figure 1-2.

The bubble sort algorithm examines the entire list of data several times.

Figure IV.1-2. The bubble sort algorithm examines the entire list of data several times.

Using Selection Sort

Another simple way to sort a list is to search the entire list until you find the smallest value. Then move that value to the front of the list. Now repeat the process all over again, skipping the first item. By repetitively searching for the smallest item and moving it to the front of the list, the selection sort algorithm can eventually sort an entire list, as shown in Figure 1-3.

The selection sort algorithm works like this:

  1. Find the smallest item in a list.

  2. Swap this value with the value currently at the front of the list.

  3. Repeat Steps 1 and 2 with the current size of the list minus one (list size = list size - 1).

Selection sort repetitively moves the smallest value to the front of the list.

Figure IV.1-3. Selection sort repetitively moves the smallest value to the front of the list.

For sorting small lists, selection sort is fast, but for sorting large lists, selection sort needs too much time progressively examining smaller sections of a list. Despite this drawback with sorting large lists, selection sort is popular because it's simple to implement.

Using Insertion Sort

The insertion sort algorithm acts like a cross between the bubble sort and the selection sort algorithm. Like bubble sort, insertion sort examines two adjacent values. Like selection sort, insertion sort moves smaller values from their current location to an earlier position near the front of the list.

The insertion sort algorithm works like this:

  1. Start with the second item in the list.

  2. Compare this second item with the first item. If the second value is smaller, swap places in the list.

  3. Compare the next item in the list and insert it in the proper place in relation to the previously sorted items.

  4. Repeat Step 3 until the entire list is sorted.

The main difference between insertion sort and selection sort is that the selection sort only swaps two adjacent values whereas insertion sort can move a value to a non-adjacent location, as shown in Figure 1-4.

Insertion sort only examines a list once to sort it.

Figure IV.1-4. Insertion sort only examines a list once to sort it.

Tip

One major advantage of the insertion sort algorithm is that it only needs to examine an entire list once to sort it. In comparison, bubble sort must repetitively examine the entire list multiple times, and selection sort must repetitively examine progressively smaller lists multiple times. As a result, the insertion sort algorithm is much faster while being easy to implement as well.

Using Shell Sort

To speed up the performance of the insertion sort algorithm, Donald Shell, a computer scientist, created a variation of the insertion sort algorithm dubbed shell sort.

One problem with insertion sort is that it must examine one value at a time. In a long list, this can take a long time. Shell sort works by dividing a long list into several smaller ones and then performing an insertion sort on each smaller list. After each smaller list gets sorted, the shell sort algorithm uses an ordinary insertion sort to sort the entire list one last time. By this time, the longer list is nearly sorted so this final insertion sort occurs quickly, as shown in Figure 1-5.

Basically, shell sort works like this:

  1. Divide a long list into multiple smaller lists.

  2. Arrange each list in a grid or table consisting of rows and columns.

    Shell sort performs multiple insertion sorts on parts of a long list.

    Figure IV.1-5. Shell sort performs multiple insertion sorts on parts of a long list.

    Each row represents the original, unsorted list. Each column holds one item in the list.

  3. Use an insertion sort to sort the columns.

  4. Repeat Steps 1-3 with progressively smaller lists until only a single list is left to be sorted with the insertion sort algorithm.

Warning

The shell sort algorithm isn't necessarily a different sorting algorithm. Instead, shell sort is a way to use the insertion sort algorithm more efficiently.

Using Heap Sort

The heap sort algorithm works by using a separate data structure — a heap (which is a binary tree data structure). The highest value gets stored in the root node while the remaining values get tossed on the heap.

The two criteria for storing data in a heap are that

  • Every parent node must contain a value greater than either of its child nodes.

  • The tree must fill each level before adding nodes to the next lower level.

If there aren't enough nodes to fill out an entire level, the nodes must fill out as much of the last level as possible, starting from the left.

Figure 1-6 shows a valid heap and two invalid heaps.

The heap sort algorithm works like this:

  1. Store an unsorted list in a heap data structure, which sorts data so the highest values appear near the top of the heap.

  2. Yank off the highest value stored in the root node and store this value as the end of the sorted list.

  3. Re-sort the heap so the highest values appear near the top of the heap.

  4. Repeat Steps 2 and 3 until all values have been removed from the heap and sorted.

Valid and invalid heap binary trees.

Figure IV.1-6. Valid and invalid heap binary trees.

Heap sort dumps an unsorted list of data into the heap, always making sure the highest value appears in the root node. Then this highest value in the root node gets stored at the end of the list. Now the heap rearranges its values once more, putting the highest remaining value in the root node, repeating the process all over again, as shown in Figure 1-7.

Heap sort uses a tree data structure to sort and store items temporarily.

Figure IV.1-7. Heap sort uses a tree data structure to sort and store items temporarily.

Initially, the heap sort algorithm may seem complicated because you need to create a heap data structure, copy and sort values among nodes, and delete nodes while you remove values and store them back in a sorted list. Although you can create a heap data structure by using a linked list, a much simpler method is to create a heap data structure by using an array, as shown in Figure 1-8.

The first array element represents the root node, the next two elements represent the child nodes of the root, and so on. Rather than manipulate a linked list as a heap, it's much simpler to rearrange values stored in an array, as shown in Figure 1-9. Because arrays are easy to implement in any programming language, the heap sort algorithm is also easy to implement. Although slightly more complicated than bubble sort or selection sort, the heap sort algorithm offers faster performance.

An array can mimic a heap data structure.

Figure IV.1-8. An array can mimic a heap data structure.

Manipulating data in an array that mimics a heap.

Figure IV.1-9. Manipulating data in an array that mimics a heap.

Using Merge Sort

The merge sort algorithm works on the principle that it's easier to sort a small list than a large list. So merge sort breaks a large list of data into two or more smaller lists. Then it sorts each small list and smashes or merges them together. This merged list is still partially unsorted. Because it's easier to sort a partially sorted list than a completely unsorted list, the merge sort algorithm sorts this partially sorted list, which ultimately sorts the entire list quickly.

The merge sort algorithm works like this:

  1. Divide a large list in half.

  2. Divide each smaller list in half until each small list consists only of one value.

  3. Sort this single value with a neighboring single value list.

  4. Merge these smaller, sorted lists into larger lists.

  5. Sort each merged list.

  6. Repeat Steps 4 and 5 until the entire list is sorted.

Figure 1-10 shows how merge sort works.

Because the merge sort algorithm successively divides a list in half, merge sort needs to create temporary data structures (such as arrays) to store data while it divides and merges values. When sorting a small list, creating a temporary data structure is simple, but when sorting a large list, creating a temporary large data structure can gobble up memory.

Note

In Perl 5.8, the default sorting algorithm is merge sort. In earlier versions of Perl, the default sorting algorithm was quick sort.

Merge sort breaks a long list into several smaller ones and then merges these back into a longer list.

Figure IV.1-10. Merge sort breaks a long list into several smaller ones and then merges these back into a longer list.

Using Quick Sort

The quick sort algorithm gets its name because it's generally the fastest sorting algorithm in most cases. Like the merge sort algorithm, the quick sort algorithm works on the principle that it's easier and faster to sort a small list than a large list, so quick sort divides a large list into two parts.

To divide a list into two parts, quick sort picks a value (or a pivot) from the list:

  • Any value less than the pivot goes into one list.

  • Any value greater than the pivot goes into the second list.

Quick sort repetitively divides each list into two parts until it creates two lists that contain only one value each. Then the algorithm sorts these two values and starts combining the multiple smaller lists to create larger sorted lists until the entire list ultimately gets sorted, as shown in Figure 1-11.

Quick sort repetitively divides a large list into two smaller lists, sorting items based on a pivot value.

Figure IV.1-11. Quick sort repetitively divides a large list into two smaller lists, sorting items based on a pivot value.

The quick sort algorithm works nearly identically to merge sort, but quick sort uses a pivot value to pre-sort values into two different lists. Sorting values by this pivot value alone makes quick sort generally faster than merge sort.

Basically, the quick sort algorithm works likes this:

  1. Pick a pivot value from the list.

  2. Divide the list in two, placing values less than the pivot value in the first list and values greater than the pivot value in the second list.

  3. Repeat Steps 1 and 2 until the lists contain only one item.

  4. Combine these smaller lists into a larger, sorted list.

Tip

The key to speeding up the quick sort algorithm is to choose the proper pivot for dividing each list. The pivot value must be a middle value in a list. Notice that the merge sort algorithm (see Figure 1-10) still sorts values through every step whereas pivot values make the quick sort algorithm (see Figure 1-11) sort the entire list in far fewer steps.

Comparing Sorting Algorithms

To compare the speed of sorting algorithms, computer scientists consider the following scenarios:

  • Best-case

  • Worst-case

  • Average-case

A best-case scenario measures the speed of different algorithms sorting a list of values that are already completely sorted. The worst-case scenario measures the speed of different algorithms sorting a list that's completely unsorted. The average-case scenario measures the speed of different algorithms sorting random values in a list.

To measure the speed and efficiency of an algorithm, computer scientists measure how much time an algorithm needs to run based on different sizes of input, which is designated by the letter (n). A small value of (n) means the input is short whereas a large value of (n) means the input is large.

An algorithm's efficiency (how fast it runs) is thus based on its input size (n). In mathematical terms, this is referred to as an order of (n). If an algorithm runs at the same speed no matter how much data it receives, it's said to run at constant time, which can be written in Big-O notation as O(1).

If an algorithm's speed depends directly on the number of items it receives, it's said to run at a linear time, written as O(n). Some common Big-O notations for different algorithms include

  • O(logn): Logarithmic time

  • O(n^c): Polynomial

  • O(c^n): Exponential

  • O(n!): Factorial

Warning

Describing algorithm efficiency as its order is known as Big-O notation because the letter O is always capitalized.

Computer scientists have already calculated the Big-O values of different sorting algorithms. Table 1-1 gives different Big-O values for different sorting algorithms for the best-case, worst-case, and average-case scenarios.

To compare the efficiency of different algorithms, plug in different values for (n).

Table IV.1-1. Comparison of Different Sorting Algorithms

Algorithm

Average

Best

Worst

Bubble sort

O(n^2)

O(n^2)

O(n^2)

Selection sort

O(n^2)

O(n^2)

O(n^2)

Insertion sort

O(n^2)

O(n)

O(n^2)

Heap sort

O(n*log(n))

O(n*log(n))

O(n*log(n))

Merge sort

O(n*log(n))

O(n*log(n))

O(n*log(n))

Quick sort

O(n*log(n))

O(n*log(n))

O(n^2)

For example, the Big-O notation for the bubble sort algorithm is O(n^2). To sort one item, the bubble sort algorithm's efficiency is O(1^2) or O(1). To sort five items, the bubble sort's efficiency is O(5^2) or O(25). The higher the Big-O value, the slower and less efficient the algorithm is, which means that the more items the bubble sort algorithm needs to sort, the slower it runs.

The quick sort algorithm has a Big-O notation of O(n*log(n)). To see how quick sort compares to bubble sort when sorting five items, plug in a value of 5 for (n), such as O(5*log(5)), O(5*0.70), or O(3.5).

With the bubble sort algorithm, the more items needed to sort (n), the higher its Big-O value. To sort five items, the bubble sort algorithm's Big-O notation is O(25), which is much larger than the quick sort algorithm's similar Big-O notation of O(3.5). The difference in these two values can give you a rough idea how much slower bubble sort works compared to quick sort when sorting the same number of items.

Tip

Table 1-1 shows that some algorithms, such as bubble sort, behave the same in best-case, worst-case, and average-case scenarios. That's because the bubble sort algorithm repetitively examines and sorts a list whether or not the values need to be sorted.

The insertion sort algorithm is unique in that it runs the fastest in a best-case (already-sorted) scenario. Although the quick sort algorithm is considered the fastest, notice that in a worst-case scenario, it's actually one of the slowest algorithms. If your data will be completely unsorted, avoid using the quick sort algorithm.

Tip

The best sorting algorithm is the one that's the fastest for sorting your type of data. If you need to write a program that regularly needs to sort random data (average-case scenario), you might choose one sorting algorithm whereas if you need to sort completely unsorted data (worst-case scenario), you'd probably choose a different algorithm. The fastest sorting algorithm always depends partially on the sorted (or unsorted) data that the algorithm needs to manipulate.

Warning

The three fastest sorting algorithms are heap sort, merge sort, and quick sort. Although quick sort is considered the fastest of the three algorithms, merge sort is faster in worst-case scenarios.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset