Linear indexing is efficient on static databases, that is, records from the database are rarely inserted or deleted. ISAM improves the performance of linear indexing, and can be used for limited updates of the database. As ISAM uses a two-level linear indexing schema, it would break down for a database where the top-level index is already too big to fit into the memory. Thus, as databases become large, we require better organization methods. One approach proposed in Chapter 6, Exploring Search Options, is that a binary search tree could potentially be utilized for indexing to store the primary and secondary keys. The binary search tree provides an efficient structure to store duplicates, and to perform operations such as deletion and insertion given that sufficient memory is available. However, the only disadvantage with a binary search tree is that it could become unbalanced.
Unbalancing becomes an issue, especially in a scenario when the tree is stored in the disk, as any operation requires data to be loaded from the disk to the memory on the path to leaf node. Thus, to minimize operation time such as insertion or search, it is recommended to store each subtree in the same block, as shown in Figure 7.8:
Figure 7.8: Example of breaking BST into blocks
In Figure 7.8, the subtree in block is loaded; thus, in the current scenario, any operation requires two block load. The R script for the BST node structure is as follows:
bstnode <- function(key, value) {
node <- new.env(hash = FALSE, parent = emptyenv())
node$key <- key # Node key
node$value <- value # Node Value
node$left <- NULL # left children key
node$right <- NULL # Right children key
class(node) <- "bstnode"
return(node)
}
The load requirement could drastically increase if the tree is unbalanced, as shown in Figure 7.9, unless the whole tree resides in the main memory, which will keep operation time restricted to O(log n), where n is the tree depth:
Figure 7.9: Example of unbalanced tree
The two major challenges to be addressed in tree-based indexing are the following:
Balancing the tree in BST is quite expensive as, usually, balancing requires reorganization of data. A 2-3 tree, discussed in next section, is an initial framework to balance a tree by keeping the leaves at the same level. The 2-3 trees are further extended to B-trees, which will be discussed later in the B-trees section.