Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

18
External Sorting

18.1. Introduction

Internal sorting deals with the ordering of records (or keys) of a file (or list) in ascending or descending order when the whole file or list is compact enough to be accommodated in the internal memory of the computer. Chapter 17 detailed internal sorting techniques such as Bubble Sort, Insertion Sort, Selection sort, Merge Sort, Shell sort, Quick Sort, Heap Sort, Radix Sort, Counting Sort and Bucket Sort.

However, in many applications and problems, it is quite common to encounter huge files comprising millions of records, which need to be sorted for their effective use in the application concerned. The application domains of e-governance, digital library, search engines, online telephone directory and electoral system, to list a few, deal with voluminous files of records.

The majority of the internal sorting techniques that we learned are virtually incapable of sorting large files since they require the whole file in the internal memory of the computer, which is impossible. Hence, the need for external sorting methods which are exclusive strategies to sort huge files.

18.1.1. The principle behind external sorting

Due to their large volume, the files are stored in external storage devices such as tapes, disks or drums. The external sorting strategies, therefore, need to take into consideration the kind of medium on which the files reside, since these influence their work strategy.

The files residing on these external storage devices are read ‘piece meal’ since only this many records, that can be accommodated in the internal memory of the computer, can be read at a time. These batches of records are sorted by making use of any efficient internal sorting method. Each of the sorted batches of records is referred to as runs. The file is now viewed as a collection of runs. The runs, as and when they are generated, are written out onto the external storage devices. The variety in the external sorting methods for a particular storage device is brought about only by the ways in which these runs are gathered and processed before the final sorted file is obtained. However, the majority of the popular external sorting methods make use of merge sort for gathering and processing the runs.

A common principle behind most popular external sorting methods is outlined below:

Internally sort batches of records from the source file to generate runs. Write out the runs as and when they are generated, onto the external storage device(s).
Merge the runs generated in the earlier phase, to obtain larger but fewer runs, and write them out onto the external storage devices.
Repeat the run generation and merge, until in the final phase only one run gets generated, on which the sorting of the file is done.

Since external storage devices play an imminent role in external sorting, we discuss sorting methods as applicable to two popular storage devices, for example, magnetic tapes and magnetic disks, the latter commonly referred to as hard disks. The reason for the choice is that these devices are representative of two different genres and display different characteristics. While magnetic tapes are undoubtedly obsolete these days, it is worthwhile to go through the external sorting methods applicable to these devices, considering the numerous research efforts and innovations that went into them, during their ‘heydays’!

Section 18.2 briefly discusses the external storage devices of magnetic tapes and disks. The external sorting method of balanced merge applicable to files stored on both tapes and disks is elaborately discussed. A crisp description of polyphase merge and cascade merge sort procedures is presented finally.

18.2. External storage devices

In this section, we briefly explain the characteristics of magnetic tapes and magnetic disks.

18.2.1. Magnetic tapes

Magnetic tape is a sequential device whose principle is similar to that of an audio tape/cassette device. It consists of a reel of magnetic tape, approximately ½” wide, and wound around a spool. Data is stored on the tape using the principle of magnetization. Each tape has about seven or nine tracks running lengthwise. A spot on the tape represents a 0 or 1 bit depending on the direction of magnetization. A combination of bits on the tracks, at any point along the length of the tape, represents a character. The number of bits per inch that can be written on the tape is known as tape density and is expressed as bpi (bits per inch). Magnetic tapes with densities of 800 bpi and 1600 bpi were in common use during the earlier days.

The magnetic tape device consists of two spindles. While one spindle holds the source reel, the other holds the take-up reel. During a forward read/write operation, the tape moves from the source reel to the take-up reel. Figure 18.1 illustrates a schematic diagram of the magnetic tape drive.

**Figure 18.1** *Schematic diagram of a magnetic tape drive*

The data to be stored on a tape is written onto it in blocks. These blocks may be of fixed or variable size. A gap of ¾” is left between the blocks and is referred to as Inter Block Gap (IBG). The IBG is long enough to permit the tape to accelerate from rest to reach its normal speed before it begins to read the next block. Figure 18.2 shows the IBG of a tape.

**Figure 18.2** *Inter block gap of a tape*

Magnetic tape is a sequential device since having read a block of data, if one desires to read another block that is several feet down the tape, then it is essential to fast forward the tape until the correct block is reached. Again, if we desire to read blocks of data that occur towards the beginning of the tape, then it is essential that the tape is rewound and the reading starts from the beginning onward. In these aspects, the characteristic of tapes is similar to that of audio cassettes.

18.2.2. Magnetic disks

Magnetic disks are still in vogue and are commonly referred to as hard disks, these days. Hard disks are random access storage devices. This means that hard disks store data in such a manner that they permit both sequential access as well as random or direct access to data.

A disk pack is mountable on a disk drive and comprises platters that are similar to phonograph records. The number of platters in a disk pack varies according to its capacity. Figure 18.3 shows a schematic diagram of a disk pack comprising 6 platters.

Recording of data is done on all surfaces of the platters except the outer surfaces of the first and last platter. Thus, for a 6-platter disk pack, of the 12 surfaces available, data recording is done only on 10 of the surfaces. Each surface is accessed by a read/write head. The access assembly comprises an assembly of access arms ending in the read/write head. The access assembly moves in and out together with the access arms so that all the read/write heads at any point of time are stationed at the same position on the surface. During a read/write operation, the read/write head is held stationary over the appropriate position on the surface, while the disk rotates at high speed to enable the read/write operation. Disk speeds ranging from 3000 rpm to 7200 rpm are common these days.

Each surface of the platter, like a phonograph record, is made up of concentric circles of tracks of decreasing radii, on which the data is recorded. Modern versions of the hard disk contain tens of thousands of tracks per surface. The tracks are numbered from 0 beginning from the outer edge of the platter. The collection of tracks of the same radii, occurring on all the surfaces of the disk pack, is referred to as a cylinder (Figure 18.3). Thus, a disk pack is virtually viewed as a collection of cylinders of decreasing radii. Each track is divided into sectors, which is the smallest addressable segment of a track. Typically, a sector can hold 512 bytes of data approximately. The early disk packs had all tracks holding the same number of sectors. The modern versions have however rid themselves of this feature to increase the storage capacity of the disk.

**Figure 18.3** *Schematic diagram of a disk pack*

To access information on a disk, it is essential to first specify the cylinder number, followed by the track number and the sector number. A multilevel index-based ISAM file organization (section 14.7) is adopted for obtaining the physical locations of records stored on the disk. The cylinder index records the highest key in each cylinder and the cylinder number. The surface index or the track index stores the highest key in each track and the track number. Finally, the sector index records the highest key in each sector and the sector number. In practice, each of the index entries also contains other spatial information to help locate the records efficiently. Thus, the cylinder, track and sector indexes form a hierarchy of indexes that help identify the physical location of the record.

The read/write head moves across the cylinders to position itself on the right cylinder. The time taken to position the read/write head on the correct cylinder is known as seek time. Once the read/write head has positioned itself on the correct track of the cylinder, it has to wait for the right sector in the track to appear under the corresponding read/write head. The time taken for the right sector to appear under the read/write head is known as latency time or rotational delay. Once the sector is reached, the corresponding data is read or written onto the disk. The time taken for the transfer of data to and from the disk is known as data transmission time.

18.3. Sorting with tapes: balanced merge

Balanced merge sort makes use of an internal sorting technique to generate the runs and employs merging to gather the runs for the next phase of the sorting. The repeated run generation and merging continue until a single run generated in the final phase delivers the sorted file.

In this section, we discuss the balanced merge when the file resides on a tape. Besides the input tape, the sorting method has to make use of a few more work tapes to hold the runs that are generated from time to time and to perform the merging of the runs as well. Example 18.1 illustrates a balanced merge sort on tapes. The sorting method makes use of a two-way merge to gather the runs.

EXAMPLE 18.1.–

Let us suppose we had to sort a file of 50,000 records (R₁, R₂, R₃, ….R50,000) which is available on a tape (Tape T₀) using a balanced two-way merge sort. Assume that the internal memory can hold only 10,000 records. Also let us suppose that there are four work tapes (T₁, T₂, T₃ and T₄) available to assist in the sorting process. R_ij indicates the j^th run in the i^th phase of the sorting.

indicates the read/write head position on the tape. The steps in the sorting process are listed below:

Step 1: Rewind all tapes and mount tapes T_0, T₁ and T₂ onto the tape drive.

Step 2: Phase 1: Read blocks of 10,000 records each from tape T₀ and internally sort them to generate runs.

Let R₁₁ (R₁….R_10,000), R₁₂ (R_10,001….R_20,000), R₁₃ (R_20,001….R_30,000), R₁₄ (R_30,001….R_40,000) and R₁₅ (R_40,001….R_50,000) be the five runs that are to be generated.

Distribute the runs alternately onto tapes T₁ and T₂. The distribution of runs on the tapes T₁ and T₂ are as shown below:

Step 3: Dismount tape T₀ and rewind tapes T₁ and T₂. Mount tapes T₁, T₂, T₃ and T₄ onto the drives. Here T₁ and T₂ are the input tapes and T₃ and T₄ are the output tapes.

Step 4: Phase 2: Merge runs on tapes T₁ and T₂ using a two-way merge to obtain longer runs R₂₁ (R₁….R_20,000), R₂₂ (R_20,001….R_40,000) and R₂₃ (R_40,001….R_50,000). The distribution of runs on the output tapes T₃ and T₄ is shown below. Note that run R₂₃ is simply copied onto tape T₃ and is just a dummy run_.

Step 5: Rewind all tapes T₁, T₂, T₃ and T₄. Mount T₃ and T₄ as the input tapes and T₁ and T₂ as the output tapes.

Step 6: Phase 3: Merge runs on tapes T₃ and T₄ using a two-way merge to obtain runs R₃₁ (R₁….R_40,000) and R₃₂ (R_40,001….R_50,000). The distribution of runs on the output tapes T₁ and T₂ are shown below. Note that run R₃₂ is simply copied onto tape T₂ and is just a dummy run.

Step 7: Rewind all tapes T₁, T₂, T₃ and T₄. Mount T₁ and T₂ as the input tapes and T₃ as the output tape.

Step 8: Phase 4: Merge runs on tapes T₁ and T₂ using a two-way merge to obtain the final run R₄₁ (R₁….R_50,000). The final run is written onto tape T_3.

18.3.1. Buffer handling

While merging runs in the balanced merge sort procedure, it needs to be observed that due to the limited capacity of the internal memory of the computer, it is not always possible to completely accommodate the runs and the merged list in it. In fact, the problem gets severe as the phases in the sort procedure progress, since the runs get longer and longer.

To tackle this problem, in the case of a two-way merge let us say, we trifurcate the internal memory into blocks known as buffers. Two of these blocks will be used as input buffers and the third as the output buffer. During the merge of two runs R₁ and R₂, for example, as many records as can be accommodated in the two input buffers are read from the runs R₁ and R₂ respectively. The merged records are sent to the output buffer. Once the output buffer is full, the records are written onto the disk. If during the merging process, any of the input buffers gets empty, it is once again filled with the rest of the records from the runs.

Example 18.2 presented a naïve view of buffer handling. In reality, issues such as proper choice of buffer lengths, efficient utilization of program buffers to enable maximum overlapping of input/output and CPU processing times need to be attended to.

18.3.2. Balanced P-way merging on tapes

In the case of a balanced two-way merge, if M runs were produced in the internal sorting phase and if 2^k−1 < M ≤ 2^k then the sort procedure makes merging passes over the data records.

Now balanced merging can easily be generalized to the inclusion of T tapes, T ≥ 3. We divide the tapes, T, into two groups, with P tapes on the one side and (T-P) tapes on the other, where 1 ≤ P < T. The initial runs generated after internal sorting are evenly distributed onto the P tapes in the first group. A P-way merge is undertaken and the resulting runs are evenly distributed onto the next group containing (T-P) tapes. This is followed by a (T-P) merge of the runs available on the (T-P) tapes, with the output runs getting evenly distributed onto the P tapes of the first group and so on. However, it has been proved that is the best choice.

Illustrative problem 18.4 discusses an example. Though balanced merging can be quite simple in its implementation, it needs to be seen if better merging patterns that save time and resources can be evolved for the specific cases in hand. Illustrative problems 18.5 and 18.6 discuss specific cases.

18.4. Sorting with disks: balanced merge

Tapes being sequential access devices, the balanced merge sort methods had to employ sizable resources for the efficient distribution of runs besides spending time for mounting, dismounting and rewinding tapes. In the case of disks which are random access storage devices, we are spared of this burden. The seek time and latency time to access blocks of data from a disk are comparatively negligible, to the time taken to access blocks of data on tapes.

The balanced merge sort procedure for disk files, though similar in principle to that of tape files, is a lot simpler. The runs generated by the internal sorting methods are repeatedly merged until a single run emerges with the entire file sorted in the final pass. Example 18.3 demonstrates a balanced merge sort on a disk file.

EXAMPLE 18.3.–

Let us suppose a file comprising 4500 records (R₁, R₂, R₃,….R₄₅₀₀) is available on a disk. The internal memory of the computer can accommodate only 750 records. Another disk is available as a scratch pad. The input disk is not to be written on. Making use of buffer handling, we presume that during internal sorting as well as merging, blocks of data comprising 250 records each are read/written. Rij indicates the j^th run generated in the i^th pass. The steps involved in undertaking a balanced two-way merge for sorting the file are shown below:

Step 1: Read three blocks of data (total 750 records) at a time from the file residing on the disk. Internally sort the blocks in the internal memory of the computer to generate six runs, which are, R01, R02, R03, R04, R05 and R06. Write the runs onto the scratch disk.

Step 2: Trifurcate the internal memory into two input buffers and a single output buffer each capable of holding 250 records.

Step 3: Read runs from the disk and merge them pair-wise, appropriately making use of buffer handling during the merging process and writing the output runs onto the scratch disk.

Step 4: Repeat step 3 until a single run emerges, holding the entire sorted file. The merging passes are schematically shown in Figure 18.4.

Figure 18.4 Balanced merge sort on disks: merging the runs (Example 18.3)

18.4.1. Balanced k-way merging on disks

As discussed in Section 18.3.2, balanced two-way merge sort can be generalized to k-way merging. For a two-way merge, as can be deduced from Figure 18.4, the number of passes over data is given by where M is the number of runs in the first level of the merge tree. A higher order merge can serve to reduce the number of passes over data. Thus, in the case of a k-way merge, k ≥ 2, the number of passes is given by , where M is the number of runs. Figure 18.5 shows the merge tree for k = 4, for an initial generation of 16 runs in a specific case.

**Figure 18.5** *Balanced k-way merge sort: merging the runs for k = 4*

Though the k-way merge can significantly reduce input/output due to the reduction in the number of passes, it is not without its ill effects. Let us suppose R₁, R₂, R₃, ….R_k are the k runs generated initially with size r_i, 1 ≤ i≤ k. During a k-way merge, the next record which is to be output is the one with the smallest key. A direct method to find the smallest key would call for (k-1) comparisons. The computing time to merge the k runs would be given by . Since passes are being made, the total number of key comparisons is given by n(k − 1)log₂ M, where n is the total number of records in the source file. We have . In other words for a k-way merge sort, the number of key comparisons increases by a factor of . Thus, for large k (k ≥ 6) the CPU time needed to perform the k-way merge will overweigh the reduction achieved in input/output time due to the reduction in the number of passes. A significant reduction in the number of comparisons to find the smallest key can be achieved by using what is known as a selection tree.

18.4.2. Selection tree

A selection tree is a complete binary tree that serves to obtain the smallest key from among a set of keys. Each internal node represents the smaller of its two children and external nodes represent the keys from which the selection of the smallest key needs to be made. The root node represents the smallest key that was selected.

Figure 18.6(a) represents a selection tree for an eight-way merge. The eight lists to be merged are L₁ (5, 7, 8), L₂ (6, 9, 9), L₃ (2, 4, 5), L₄ (1, 7, 8), L₅ (3, 6, 9), L₆ (5, 5, 6), L₇ (3, 4, 9), L₈ (6, 8, 9). The external nodes represent the first set of 8 keys that were selected from the lists. Progressing from the bottom up, each of the internal nodes represents the smaller key of its two children until at the root node the smallest key gets automatically represented. The construction of the selection tree can be compared to a tournament being played with each of the internal nodes recording the winners of the individual matches. The final winner is registered by the root node. A selection tree, therefore, is also referred to as a tree of winners.

**Figure 18.6** *Selection tree for an eight-way merge*

In this case, the smallest key, for example, 1 is dropped into the output list. Now the next key from L₄ for example, 7 enters the external node. It is now essential to restructure the tree to determine the next winner. Observe how it is now sufficient to restructure only that portion of the tree occurring along the path from the node numbered 11 to the root node. The revised key values of the internal nodes are shown in Figure 18.6(b). Note how in this case, the keys compared and revised along the path are (2,7), (5,2) and (2,3). The root node now represents 2 which is the next smallest key.

In practice, the external nodes of the selection tree are represented by the records and the internal nodes are only pointers to the records which are winners. For ease of understanding the internal nodes in Figure 18.6 were represented using the keys themselves, though in reality, they are only pointers to the winning records.

Despite its merits, a selection tree can result in increased overheads associated with maintaining the tree. This happens especially when the restructuring of the tree takes place to determine the next winner. It can be seen that when the next key walks into the tree, tournaments have to be played between sibling nodes who earlier, were losers.

Note how in the case of 7 entering the tree, the tournaments played were between (2,7), (5,2) and (2,3), where 2, 5 and 3 were losers in the earlier case. It would therefore be prudent if the internal nodes could represent the losers rather than the winners. A tournament tree in which each internal node retains a pointer to the loser is called a tree of losers.

Figure 18.7(a) illustrates the tree of losers for the selection tree discussed in Figure 18.6. Node 0 is a special node that shows the winner. As said earlier, each of the internal nodes is shown carrying the key when in reality they represent only pointers to the loser records. To determine the smallest key, as before, a tournament is played between pairs of external nodes. Though the winners are ‘remembered’, it is the losers that the internal nodes are made to point to. Thus, nodes numbered ((4), (5), (6) and (7)) record pointers to the losing external nodes, which are the ones with the key values of 6, 2, 5 and 6, respectively. Now node numbered (2) conducts a tournament between the two winners of the earlier game, which are key values 5 and 1 and records the pointer to the loser which is 5. In a similar way, node numbered (3) records the pointer to the loser node with key value 3. Progressing in this way the tree of losers is constructed and node 0 outputs the winning key value which is the smallest.

Once the smallest key, which is 1 has been output and the next key 7 enters the tree, the restructuring is easier now since the sibling nodes with which the tournaments are to be played are losers and these are directly pointed to by the internal nodes. The restructured tree is shown in Figure 18.7(b).

**Figure 18.7** *Tree of losers for the eight-way merge*

18.5. Polyphase merge sort

Balanced k-way merge sort on tapes calls for an even distribution of runs on the tapes and to enable efficient merging requires 2k tapes to avoid wasteful passes over data. Thus, while k tapes act as input devices holding the runs generated, the other k tapes act as output devices to receive the merged runs. The k tape groups swap roles in the successive passes until a single run emerges in one of the tapes, signaling the end of sort.

It is possible to avoid wasteful redistribution of runs on the tapes while using less than 2k tapes by a wisely thought-out run redistribution strategy. Polyphase merge is one such external sorting method that makes use of an intelligent redistribution of runs during merging, so much that a k-way merge requires only (k+1) tapes!

The central principle of the method is to ensure that in each pass (except the last of course!) during the merge, the runs are to be cleverly distributed so that one tape is always rendered empty while the other k tapes hold the input runs that are to be merged! The empty tape for the current pass acts as the output tape for the next pass and so on. Ultimately, as in balanced merge sort, the final pass delivers only one run in one of the tapes.

At this point of time, we introduce a useful notation mentioned in the literature to enable a crisp presentation of run distribution. Runs that are initially generated by internal sorting are thought to be of length 1 (unit of measure). Thus, if there are t runs that are initially generated then the notation would describe it as 1^t. For example, if there were 34 runs that were initially generated then it would be represented as 1³⁴. Similarly, if after a merge there were 14 runs of size 2, it would be represented as 2¹⁴. In general, t runs of size s would be represented as s^t.

Example 18.4 illustrates polyphase merge on 3 tapes.

EXAMPLE 18.4.–

Let us suppose that a source file was initially sorted to generate 34 runs of size 1 (1³⁴). We demonstrate a polyphase merge on three tapes (T₁, T₂ and T₃) undertaking a two-way merge during each phase. Table 18.1 shows the redistribution of runs on the tapes in each phase.

Note how in phase 8, polyphase merge successfully completes its sorting by creating the final run of sorted records. Also, observe how in each phase one of the tapes is rendered empty while the other two are non-empty. Now, what is the trick behind this procedure?

Table 18.1 Polyphase merge on three tapes: redistribution of runs

Phase	Tape T₁	Tape T₂	Tape T₃	Remarks
1	1¹³	1²¹	-	Initial distribution of runs
2	-	1⁸	2¹³	Merge to T₃
3	3⁸	-	2⁵	Merge to T₁
4	3³	5⁵	-	Merge to T₂
5	-	5²	8³	Merge to T₃
6	13²	-	8¹	Merge to T₁
7	13¹	21¹	-	Merge to T₂
8	-	-	34¹	Merge to T₃

Let us suppose that ‘intuitively’ we decided to distribute 13 runs of size 1 and 21 runs of size 1 onto tapes T₁ and T₂ respectively. In phase 2, because it is a two-way merge and polyphase merge expects one tape to fall vacant in every phase, we use up all the 13 runs of size 1 in tape T₁ for a merge operation with an equivalent number of runs in tape T₂. This yields 13 runs of double the size (2¹³) which is written onto the empty tape T₃. So that leaves 8 runs of size 1 on tape T₂ that could not be used up and renders tape T₁ empty. Again, in phase 3, 1⁸ runs in tape T₂ are merged with an equivalent amount of runs in tape T₃ to obtain 3⁸ which is written onto tape T₁. This leaves a balance of 2⁵ runs on tape T₃ and renders tape T₂ empty. The phases continue until in phase 8 a single run 34¹ gets written onto tape T₃.

To determine how the initial distribution of 1¹³ and 1²¹ was conceived, we work backward from the last phase. Let us suppose there were n phases for a 3-tape case. In the n^th phase, we should arrive at exactly one run on a tape T₁ (let us say) with tapes T₂ and T₃ totally empty. This implies that in phase (n-1) there should have been two runs of size 1 on tapes T₂ and T₃, which should have been merged as a single run on T₃ in the n^th phase. Continuing in this fashion we obtain the initial distribution of runs to be 1¹³ and 1²¹ on the two tapes respectively. Table 18.2 lists the run distribution for a 3-tape polyphase merge.

It can be easily seen that the number of runs needed for an n-phase merge is given by F_n + F_n_-1 where F_i is the i^th Fibonacci number. Hence, this method of redistribution of runs is known as the Fibonacci merge. The method can be clearly generalized to k-way merging on (k+1) tapes using generalized Fibonacci numbers.

Table 18.2 Run distribution for a 3-tape polyphase merge

Phase	Tape T₁	Tape T₂	Tape T₃
n	0	0	1
n-1	1	1	0
n-2	2	0	1
n-3	0	2	3
n-4	3	5	0
n-5	8	0	5
n-6	0	8	13
n-7	13	21	0
n-8	34	0	21

18.6. Cascade merge sort

Cascade merge is another intelligent merge pattern that was discovered before the polyphase merge. The merge pattern makes use of a perfectly devised initial distribution of runs on the tapes. While the polyphase merge sort employs a uniform merge pattern during the run generation, the cascade merge sort makes use of a ‘cascading’ merge pattern in each of its passes. Thus, for t tapes, while polyphase merge uniformly employs a (t-1) merge for the run generation, cascade sort employs (t-1) merge, (t-2) merge and so on, in the same pass for its run generation.

Example 18.5 demonstrates cascade merge on 6 tapes for an initial generation of 55 runs of length 1. We make use of the run distribution notation introduced in section 18.5.

EXAMPLE 18.5.–

There are six tapes (T₁, T₂, T₃, T₄, T₅ and T₆) using which 55 runs of length 1 (1⁵⁵) are to be cascade merge sorted, to generate the final run (55¹).

Table 18.3 illustrates the run distribution of cascade merge.

Table 18.3 Run distribution on six tapes by cascade merge

Pass	T₁	T₂	T₃	T₄	T₅	T₆
Initial distribution	1¹⁵	1¹⁴	1¹²	1⁹	1⁵	-
1	-	1¹	2²	3³	4⁴	5⁵
2	15¹	14¹	12¹	9¹	5¹	-
3	-	-	-	-	-	55¹

As before, let us assume that the initial distribution of (1¹⁵, 1¹⁴, 1¹², 1⁹, 1⁵) runs on the tapes (T₁, T₂, T₃, T₄, T₅), was devised through some ‘intuitive’ means.

In pass 1, we undertake a series of merges. A five-way merge on (T₁, T₂, T₃, T₄, T₅) yields the run 5⁵ that is put onto tape T₆. A four-way merge on (T₁, T₂, T₃, T₄) yields 4⁴ which is put onto tape T₅. A three-way merge on (T₁, T₂, T₃) yields 3³ which are distributed onto tape T₄. A two-way merge on (T₁, T₂) yields 2² which is put onto tape T₃. Lastly, a 1-way merge (which is mere copying of the balance run) on T₁ yields 1¹ which is copied onto tape T₂. Of course, one could do away with the 1-way merge which is mere copying of the run and retain the run in the very tape itself. In pass 1, tape T₁ falls empty.

In pass 2, we repeat the cascading merge wherein the five-way merge on (T₂, T₃, T₄, T₅, T₆) yields the run 15¹, a four-way merge on (T₃, T₄, T₅, T₆) yields 14¹ and so on until at the end of pass 2, the distribution of runs on the tapes is as shown in the table. This is the penultimate pass and observes how the distribution records one run each on the tapes. In the final pass, as it always is, the five-way merge releases a single run of size 55 which is the final sorted file.

Table 18.4 Run distribution on five tapes by cascade merge

Phase	T₁	T₂	T₃	T₄	T₅
n	1	0	0	0	0
n-1	1	1	1	1	1
n-2	5	4	3	2	1
n-3	15	14	12	9	5
n-4	55	50	41	29	15
n-5	190	175	146	105	55
n-6	671	616	511	365	190

Now, how does one arrive at the perfect initial distribution? As was done for polyphase merge, this could be arrived at by working backward from the goal state of (1, 0, 0, 0, 0) obtained during the n^th pass. Table 18.4 illustrates the run distribution by cascade merge on five tapes.

For an in-depth analysis of merge patterns and other external sorting schemes, a motivated reader is referred to Volume III of the classic book “The ART of Computer Programming” (Knuth 1998).

Summary

External sorting deals with sorting of files or lists that are too huge to be accommodated in the internal memory of the computer and hence need to be stored in external storage devices such as disks or drums.
The principle behind external sorting is to first make use of any efficient internal sorting technique to generate runs. These runs are then merged in passes to obtain a single run at which stage the file is deemed sorted. The merge patterns called for by the strategies, are influenced by external storage medium on which the runs reside, viz., disks or tapes.
Magnetic tapes are sequential devices built on the principle of audio tape devices. Data is stored in blocks occurring sequentially. Magnetic disks are random access storage devices. Data stored in a disk is addressed by its cylinder, track and sector numbers.
Balanced merge sort is a technique that can be adopted on files residing on both disks and tapes. In its general form, a k-way merging could be undertaken during the runs. For the efficient management of merging runs, buffer handling and selection tree mechanisms are employed.
Balanced k-way merge sort on tapes calls for the use of 2k tapes for an efficient management of runs. Polyphase merge sort is a clever strategy that makes use of only (k+1) tapes to perform the k –way merge. The distribution of runs on the tapes follows a Fibonacci number sequence.
Cascade merge sort is yet another smart strategy which unlike polyphase merge sort does not employ a uniform merge pattern. Each pass makes use of a ‘cascading’ sequence of merge patterns.

18.7. Illustrative problems

PROBLEM 18.1.–

The specification for a typical disk storage system is shown in Table P18.1. An employee file consisting of 100,000 records is stored on the disk. The employee record structure and the size of the fields in bytes (shown in brackets) are given below:

Table P18.1. Specifications of a typical disk storage system

What is the storage space (in terms of bytes) needed to store the employee file in the disk?
What is the storage space (in terms of cylinders) needed to store the employee file in the disk?

Solution:

a) The size of the employee record = 118 bytes

Number of employee records that can be held in a sector = 512/118 = 4 records

Number of sectors needed to hold the whole employee file = 100,000/4 = 25,000 sectors

∴ The total number of bytes needed to store the file in the disk = 25,000 × 512 = 12,800,000 bytes = 12.2 megabytes

b) Number of tracks needed to hold the whole employee file given that there are 50 sectors/track = 25,000 / 50 = 500 tracks

∴ Number of cylinders needed to store the whole file given that there are 10 tracks/cylinder = 500/10 = 50 cylinders

PROBLEM 18.2.–

For the employee file discussed in illustrative problem 18.1, making use of Table P18.1, answer the questions given below:

Records from the employee file are to be read and, by making use of the basic pay, allowances and deductions, the total salary is to be computed for each employee. Assume that it takes 200 µs of CPU time to perform the computation for a single record. The updated records are to be written onto the disk.

What is the time taken to process a sector of records?
Having processed a sector of records, what is the time taken to process all records in the very next sector?
What is the time taken to process the records, in all sectors of a track, assuming that the sectors are continuously read?
What is the time taken to process all records in a cylinder?

Solution:

a) The time taken to process a sector full of records =

(1) Time taken to access the cylinder + (2) Time taken to access the sector + (3) time taken to read the records + (4) time taken to compute the net salary for the records + (5) time taken to access the sector to write back the records + (6) time taken to write the updated records onto the sector.

For (1) and (2), since the question pertains to an arbitrary sector, we choose to use the average seek time of 25 ms and the average latency time of 8.33 ms, respectively. For (3) and (6), the time taken is 0.33 ms each. For (4), it is 0.8 ms (200 µs × 4 records).

The computation of (5) which is in fact the time taken for the sector to appear under the read/write head to perform the write operation, is a trifle involved. It is computed as, (the maximum latency time (time taken for the track to make a full revolution) – time taken to read the sector – time taken to process the records by the CPU).

This is given by (16.66 - 0.33 - 0.8) = 15.53 ms.

∴ the time taken to process all records in a sector =

b) While the time taken to process records in the first sector (question a) of illustrative problem 18.2 includes the time taken to access the cylinder and the sector, to process the very next sector, there is no need to include the cylinder and sector access time since the reading is continuously done.

∴ the time taken to process the records in the very next sector =

(3) time taken to read the records + (4) time taken to compute the net salary for the records + (5) time taken to access the sector to write back the records + (6) time taken to write the updated records onto the sector.

c) The time taken to process all records on a track =

(7) time taken to process records in the first sector of the track +

(8) time taken to process records in the next sector of the track × 49 sectors.

Here (7) and (8) have been obtained in questions (a) and (b) of illustrative problem 18.2 respectively and therefore the result is given as,

d) The time taken to continuously process all records in a cylinder, calls for processing all records track after track. Once the records in the first occurring track have been processed, the rest of the tracks in the cylinder are instantaneously accessed.

∴ the time taken to process all records in a cylinder =

(9) time taken to process all records in the first track + (10) time taken to process all records in the next track of the cylinder × 9 tracks

While (9) is found in question (c) of illustrative problem 18.2, to compute (10) we simply need to use the time computed in (8) for all the 50 sectors in the next track.

Therefore, the result is given as 882.83 + 16.99 × 50 × 9 = 8.528 s

PROBLEM 18.3.–

Illustrative problem 18.2(d) computed the time taken to process all records of the employee file residing in a cylinder. Assume that the time taken for the read/write head to move from one cylinder to another is 10 ms.

What is the time taken to process all records in the next cylinder?
What is the time taken to process the entire employee file of records on the disk?

Solution:

a) Having processed a cylinder of records, the time taken to move to the next cylinder is 10 ms. The time taken to process all records in the next cylinder is a straightforward computation given by:

(8) Time taken to process all records on the next sector × 50 sectors × 10 tracks.

Here, (8) is obtained in question (b) of illustrative problem 18.2.

∴ the total time taken to process all records in the next cylinder, moving from the current cylinder = 10 + ( 16.99 × 50 × 10 ) = 8.505 s

b) The entire employee file resides on 50 cylinders (question (b) of illustrative problem 18.1).

Therefore, the time taken to process the entire file =

(11) Time taken to process records in the first cylinder + (12) time taken to process records in the next cylinder × 49.

(11) is obtained in question (d) of illustrative problem 18.2 and (12) is obtained in question (a) of illustrative problem 18.3.

∴ the time taken to process the entire employee file = 8.528 + 8.505 × 49 = 7.088 min

PROBLEM 18.4.–

Given a file of 50,000 records with an internal memory capacity of 10,000 records, trace the steps of a Balanced P-way merge sort for T = 6 tapes (T₁, T₂, T₃, T₄, T₅, T₆) and P = 3.

Solution:

An internal sort of the file yields five runs of 10,000 records each. Since P = 3, we need to undertake a three-way merge. We, therefore, divide the six tapes into two groups of three tapes each. The two groups alternate as the input and output tapes during the merge passes.

The initial distribution of runs on the tapes T₁, T₂ and T₃ after internal sorting, are as follows:

Rewind the tapes T₁, T₂ and T₃. In the next pass, the three-way merge of runs in tapes T₁, T₂ and T₃ yield output runs on T₄, T₅ and T₆ as follows:

Rewind tapes T₄ and T₅. In the last pass a three-way merge of runs in tapes T₄ and T₅ yields the final run on tape T₁ as follows:

PROBLEM 18.7.–

Let us suppose a source file was initially sorted to generate 55 runs of size 1 (1⁵⁵). Trace polyphase merge on three tapes (T₁, T₂, T₃) undertaking a two-way merge during each phase.

Solution:

Table P18.7 shows the redistribution of runs on the tapes in each phase. Observe how the initial distribution of runs is taken after the Fibonacci number sequence. The polyphase merged file is available on tape T₁ in the final phase.

Table P18.7. Polyphase merge on three tapes: redistribution of runs

Phase	Tape T₁	Tape T₂	Tape T₃	Remarks
1	1²¹	1³⁴	-	Initial distribution of runs
2	-	1¹³	2²¹	Merge to T₃
3	3¹³	-	2⁸	Merge to T₁
4	3⁵	5⁸	-	Merge to T₂
5	-	5³	8⁵	Merge to T₃
6	13³	-	8²	Merge to T₁
7	13¹	21²	-	Merge to T₂
8	-	21¹	34¹	Merge to T₃
9	55¹	-	-	Merge to T₁

PROBLEM 18.8.–

There are six tapes (T₁, T₂, T₃, T₄, T₅, T₆) using which 190 runs of length 1 (1¹⁹⁰) are to be cascade merge sorted, to generate the final run (190¹). Trace the steps of the sorting process.

Solution:

Table P18.8 illustrates the run distribution of cascade merge.

Table P18.8. Run distribution on 6 tapes by cascade merge

Pass	T₁	T₂	T₃	T₄	T₅	T₆
Initial distribution	1⁵⁵	1⁵⁰	1⁴¹	1²⁹	1¹⁵	-
1	-	1⁵	2⁹	3¹²	4¹⁴	5¹⁵
2	15⁵	14⁴	12³	9²	5¹	-
3	-	15¹	29¹	41¹	50¹	55¹
4	190¹

Review questions

1. Cascade merge sort adopts uniform merge patterns in its passes.
2. The distribution of runs in the last pass of cascade merge sort is given by a pattern such as (1, 0, 0, …0).
1. 1. true
  2. true
2. 1. true
  2. false
3. 1. false
  2. false
4. 1. false
  2. true
Polyphase merge sort for a k-way merge on tapes requires ------------ tapes.
1. 2k
2. (k-2)
3. (k+1)
4. k
The time taken for the right sector to appear under the read/write head is known as:
1. seek time
2. latency time
3. transmission time
4. data read time
In the case of a balanced two-way merge, if M runs were produced in the internal sorting phase and if then the sort procedure makes --------- merging passes over the data records.
1. M
4. M²
Match the following:
W. Magnetic tape A. tree of winners
X. Magnetic disks B. Fibonacci merge
Y. Polyphase merge C. Inter Block Gap
Z. k-way merge D. platters
1. (W A) (X B) (Y D) (Z C)
2. (W C) (X D) (Y B) (Z A)
3. (W C) (X D) (Y A) (Z B)
4. (W A) (X B) (Y C) (Z D)
What is the general principle behind external sorting?
How is a selection tree useful in a k-way merge?
What are the advantages of Polyphase merge sort over balanced k-way merge sort?
What is the principle behind the distribution of runs in a cascade merge sort?
How is data organized in a magnetic disk?
An inventory record contains the following fields: ITEM NUMBER (8 bytes), NAME (20 bytes), DESCRIPTION (20 bytes), TOTAL STOCK (10 bytes), PRICE(10 bytes), TOTAL PRICE (14 bytes).
A record comprising the data on Item number, name, description and total stock is to be read and based on the current price which is input, the total price is to be computed and updated in the fields. There are 25, 000 records to be processed. Assuming the disk characteristics given in Table P18.1:
1. How much storage space is required to store the entire file in the disk (in terms of bytes/KB/MB)?
2. How much storage space is required to store the entire file in the disk in terms of cylinders?
3. What is the time required to read, process and write back a given sector of records into the disk, assuming that it takes 100 µs to process a record?
4. What is the time required to read, process and write back an entire track of records if they were read sequentially sector after sector?
5. What is the time required to read, process and write back an entire cylinder of records?
6. What is the time required to read, process and write back the records in the next (immediate) cylinder?
7. What is the time required to read, process and write back the entire file onto the disk?
A file comprising 500,000 records is to be sorted. The internal memory has the capacity to hold only 50,000 records. Trace the steps of a Balanced k-way merge for (i) k = 2 and (ii) k = 4, when (a) the file is available on a tape and (ii) the file is available on a disk. Assume the availability of any number of tapes and a scratch disk for undertaking the appropriate sorting process.

W. Magnetic tape	A. tree of winners
X. Magnetic disks	B. Fibonacci merge
Y. Polyphase merge	C. Inter Block Gap
Z. k-way merge	D. platters

Programming assignments

Implement a function to construct a tree of winners to obtain the smallest key from a list of keys representing its external nodes.
Implement a function to construct a tree of losers to obtain the smallest key from a list of keys representing its external nodes.
Making use of the function(s) developed in programming assignment 1 (and programming assignment 2), implement k-way merge algorithms for any given value of k.
Implement a balanced k-way merge sort for disk-based files. Simulate the program for various sizes of files, internal memory capacity and choice of k. Graphically display the distribution of runs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 18 External Sorting

Create new playlist

Sign In

Sign Up

18.1. Introduction

18.1.1. The principle behind external sorting

18.2. External storage devices

18.2.1. Magnetic tapes

18.2.2. Magnetic disks

18.3. Sorting with tapes: balanced merge

18.3.1. Buffer handling

18.3.2. Balanced P-way merging on tapes

18.4. Sorting with disks: balanced merge

18.4.1. Balanced k-way merging on disks

18.4.2. Selection tree

18.5. Polyphase merge sort

18.6. Cascade merge sort

Summary

18.7. Illustrative problems

Review questions

Programming assignments

Table of Contents for
18 External Sorting