Tree-based indexing

Linear indexing is efficient on static databases, that is, records from the database are rarely inserted or deleted. ISAM improves the performance of linear indexing, and can be used for limited updates of the database. As ISAM uses a two-level linear indexing schema, it would break down for a database where the top-level index is already too big to fit into the memory. Thus, as databases become large, we require better organization methods. One approach proposed in Chapter 6, Exploring Search Options, is that a binary search tree could potentially be utilized for indexing to store the primary and secondary keys. The binary search tree provides an efficient structure to store duplicates, and to perform operations such as deletion and insertion given that sufficient memory is available. However, the only disadvantage with a binary search tree is that it could become unbalanced.

Unbalancing becomes an issue, especially in a scenario when the tree is stored in the disk, as any operation requires data to be loaded from the disk to the memory on the path to leaf node. Thus, to minimize operation time such as insertion or search, it is recommended to store each subtree in the same block, as shown in Figure 7.8:

Tree-based indexing

Figure 7.8: Example of breaking BST into blocks

In Figure 7.8, the subtree in block is loaded; thus, in the current scenario, any operation requires two block load. The R script for the BST node structure is as follows:

bstnode <- function(key, value) { 
  node <- new.env(hash = FALSE, parent = emptyenv()) 
  node$key <- key           # Node key 
  node$value <- value       # Node Value 
  node$left <- NULL         # left children key 
  node$right <- NULL        # Right children key 
  class(node) <- "bstnode" 
  return(node) 
}

The load requirement could drastically increase if the tree is unbalanced, as shown in Figure 7.9, unless the whole tree resides in the main memory, which will keep operation time restricted to O(log n), where n is the tree depth:

Tree-based indexing

Figure 7.9: Example of unbalanced tree

The two major challenges to be addressed in tree-based indexing are the following:

  • How to keep the tree balanced
  • How to minimize the path from the root node to the leaf node

Balancing the tree in BST is quite expensive as, usually, balancing requires reorganization of data. A 2-3 tree, discussed in next section, is an initial framework to balance a tree by keeping the leaves at the same level. The 2-3 trees are further extended to B-trees, which will be discussed later in the B-trees section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset