Chapter 2
The Power of NoSQL and MongoDB

This chapter explains NoSQL along with a comparison to traditional relational databases. We briefly discuss the four main types of NoSQL databases: document, key-value, column-oriented, and graph. Then we cover what makes MongoDB unique followed by how to install MongoDB on your computer.

NoSQL vs. the Traditional Relational Database

NoSQL is a name for the category of databases built on non-relational technology. NoSQL is not a good name for what it represents as it is less about how to query the database (which is where SQL comes in) and more about how the data is stored (which is where relational structures comes in). Even Carlo Strozzi, who first used the term NoSQL in 1998 to name his lightweight, open-source relational database that did not expose the standard SQL interface, suggests that “NoREL” would have been a better term than NoSQL [Wikipedia]. However, because the term NoSQL is in widespread use today, we will continue to use this term in this chapter.

There are four factors that distinguish traditional relational databases (abbreviated as “RDBMS”) from NoSQL databases: variety, structure, scaling, and focus. Here is a table summarizing these differences:

 

RDBMS

NoSQL

Variety

One type (relational)

Four main types: document, column-oriented, key-value, and graph

Structure

Predefined

Dynamic

Scaling

Primarily vertical

Primarily horizontal

Focus

Data integrity

Data performance and availability

Variety

RDBMS have one type of structure (relational) where individual records are stored as rows in tables, with each column storing a specific piece of data about that record. When data is needed from more than one table, these tables are joined together. For example, offices might be stored in one table and employees in another. When a user wants to find the work address of an employee, the Employee and Office tables are joined together. NoSQL has four main varieties, and each will be described shortly: document, column-oriented, key-value, and graph. Each variety has its own way of storing data. An RDBMS has traditionally been chosen as the solution for every type of operational or analytical scenario. NoSQL solutions, however, excel in particular scenarios and therefore are used for specific types of situations (e.g., a graph database is often used in scenarios where a document database would be inefficient).

Structure

The RDBMS database structure is predefined. To store information about a new property, we need to modify the structure before we can add the data. Before we can add all of our employees’ birth dates, we first need to create the Employee Birth Date field. The NoSQL structures are typically dynamic. New types of information can be added as needed without having to reload data or rebuild database structures.

Scaling

RDBMS are usually just scaled vertically, meaning a single server must be upgraded in order to deal with increased demand. NoSQL databases are usually scaled horizontally, meaning that to add capacity, a database administrator can add more inexpensive servers or cloud instances. The database automatically replicates and/or divides data across servers as necessary. This process of automatically replicating or dividing data is called “sharding” in MongoDB.

Focus

With RDBMS, the focus is on data integrity. A handy acronym to remember is ACID (Atomic, Consistent, Isolated, and Durable):

  • Atomic. Everything within a transaction succeeds or the entire transaction is rolled back. For example, if an Order is deleted, all of its Order Lines are also deleted. When I transfer $10 from my savings to checking account, $10 is deducted from my savings account and credited to my checking account.
  • Consistent. Consistent means that the data accurately reflects any changes up to a certain point in time. With RDBMS it is the current state, which is achieved through database constraints. A transaction cannot leave the database in an inconsistent state. If an Order Line exists without its Order, recreate the Order or remove the Order Line. If $10 was deducted from my savings account but not credited to my checking account, roll the transaction back so the $10 goes back to my checking account.
  • Isolated. Transactions cannot interfere with each other. That is, transactions are independent. The $10 transferred from my savings to checking account is a separate transaction from the $20 check I wrote, and these two transactions (fund transfer or fund withdrawal) are separate from each other.
  • Durable. Completed transactions persist even when servers restart or there are power failures. That is, there is no undo button. Once the Order and its Order Lines are deleted, they are gone from the system even if electricity is lost in the building in which the server resides.

With NoSQL, the focus is less on data integrity and more on data performance and availability. A handy acronym to remember is BASE (Basically Available, Soft-state, Eventual consistency):

  • Basically Available. There is a response to every query, but it could be a response saying there was a failure in getting the data or the response that comes back may be in an inconsistent or changing state.1 As an Amazon Seller, for example, when I check my book inventory levels, there is usually a warning in bold red letters saying the data I am viewing is incomplete or more than 24 hours old and therefore may not be accurate.
  • Soft-state. Soft-state means the NoSQL database plays “catch up,” updating the database continuously even with changes that occurred from earlier in the day. Although no one might have purchased a book from me between 2 and 3 am, for example, there might still be transactions occurring during this time that adjust my inventory levels for purchases from the prior day.
  • Eventual consistency. Recall that “consistent” means that the data accurately reflects any changes up to a certain point in time2. With NoSQL, the system will eventually become consistent once it stops receiving input. It is acceptable for the results not to be 100% current, assuming the accuracy eventually catches up at the end of the week or month. Even though inventory adjustments might occur minutes or hours after book purchases, eventually the inventory levels I am viewing will be accurate.

Four Types of NoSQL Databases

Here is a sketch for the traditional RDBMS structure for storing author-and title-related information:

The RDBMS structure stores like information together such as authors in the Author table and titles in the Title table. The lines connecting these tables represent the rules (also known as “database constraints”) that enforce how many values of one table can connect to how many values of the other table. For example, the line between Author and Title would capture that an author can write many titles and a title can be written by many authors. If we wanted to know who wrote a particular title, we would need to join Title with Author.

There is also a line from Author to Author, which captures any relationship between authors such as one author who wrote a book review for another author or two authors who are on the same tennis team, etc. A constraint that starts and ends on the same table, such as this author constraint, is called a recursive relationship. (Recursive relationships will be discussed in the next chapter.)

Let’s contrast this traditional way of structuring data with the four varieties of NoSQL databases. Here is how the author and title information can be captured in each of these four types of databases:

  • Document. Instead of taking a business subject and breaking it up into multiple relational structures, document databases frequently store the business subject in one structure called a “document.” For example, instead of storing title and author information in two distinct relational structures, title, author, and other title-related information can all be stored in a single document called Title. This is similar to a search you would do in a library for a particular book, where both title and author information appear together. We have everything to do with the title, including authors and subject, in the same document. It is all in one place, and we do not have to join to separate places to get everything we need. Document-oriented is much more application focused as opposed to table oriented, which is more data focused. MongoDB is a document-based database.
  • Key-value. Key-value databases allow the application to store its data in only two columns (“key” and “value”), with more complex information sometimes stored within the “value” columns such as Subjects in our example above. So instead of doing the work ahead of time to determine whether we have authors or titles, we can add each type of information (Title Name, Author Name, etc.) as a key and then add the value assigned to it. Key-value databases include Dynamo, Cache, and Project Voldemort.
  • Column-oriented. Out of the four types of NoSQL databases, column-oriented is closest to the RDBMS. Both have a similar way of looking at data as rows and values. The difference, though, is that RDBMSs work with a predefined structure and simple data types, such as amounts and dates, whereas column-oriented databases, such as Cassandra, can work with more complex data types including unformatted text and imagery. This data can also be defined on the fly. So in our author/title example, we can have a title and the title has an author along with a complex data type called Title Characteristics, which contains the language, format, and subjects, where subjects is another complex data type (an array).
  • Graph. This kind of database is designed for data whose relations are well represented as a set of nodes with an undetermined number of connections between these nodes. Examples where a graph database can work best are social relations (where nodes are people), public transport links (where nodes could be bus or train stations), or road maps (where nodes could be street intersections or highway exits). Often requirements lead to traversing the graph to find the shortest routes, nearest neighbors, etc., all of which can be complex and time consuming to navigate with a traditional RDMBS (usually requiring recursion, which is not for the faint of heart and will be discussed in the next chapter). In our author example above, there are an indeterminate number of ways authors can be related with each other such as contributors, co-authors, book reviewers, etc. Graph databases include Neo4J, Allegro, and Virtuoso.

MongoDB Is a Document-Oriented NoSQL Database

Since you’re reading this book, you’ve probably already chosen to implement your database in MongoDB. MongoDB is known for high performance, high availability, and low cost because of these four properties:

  1. Document-oriented. Instead of taking a business subject and breaking it up into multiple relational structures, MongoDB can store the business subject in the minimal number of doucments. For example, instead of storing title and author information in two distinct relational structures, title, author, and other title-related information can all be stored in a single document called Book, which is much more intuitive and usually easier to work with.
  2. Extremely extensible. The document-oriented approach makes it possible to represent complex hierarchical relationships in one place. You do not need to define schemas ahead of time; instead, you can define new types of data (called “fields”) as you add the actual data. This is very different from building traditional relational databases where you need to define the structure before it is populated with data. In MongoDB, you can populate and define the structure at the same time. This makes development take less time and allows the project team to easily experiment with different solutions and then choose the best one. The figure below illustrates that with an RDBMS, you need to predefine the structure. In music, you need to write down the musical notes before you can play them. With MongoDB, however, you can define the structure of data and populate the data at the same time, much like composing music and playing it at the same time.

  3. Horizontally scalable. MongoDB was designed to scale out. Documents are relatively easy to partition across multiple servers. MongoDB can automatically balance data across servers and redistribute documents, automatically routing user requests to the correct machines. When more capacity is needed, new machines can be added, and MongoDB will figure out how the existing data should be allocated to them. This figure illustrates the systems architecture difference:

  4. Lots of drivers! A driver is a translator between a program and a platform (the platform in this case being MongoDB). MongoDB has an official set of drivers whose core functions, such as find( ) and update( ), are described in Chapter 4. In addition, the development community has many other drivers for various languages support such as for Ruby, Python, and C++. The full list of drivers appears at http://docs.mongodb.org/ecosystem/drivers/. This allows developers to use their language of choice and leverage their existing skills to build MongoDB applications even faster and more efficiently.

Installing MongoDB

The MongoDB installation guide, http://docs.mongodb.org/manual/installation/, has well-written instructions on how to install MongoDB on the various operating systems. I followed these instructions and downloaded the Windows executable from http://www.mongodb.org/downloads. After the file finished downloading, I double-clicked on the executable and followed the steps for installation.

After installation was complete, I set up a default directory where all of the data can be stored. To do this, I right-clicked on the command prompt icon and chose “Run as Administrator,” and then typed this statement to set up the default directory:

mkdir -p datadb

Once you set up your default directory, you can start the MongoDB server. If you still have the command prompt window open after creating the default directory, type in this statement to start the MongoDB server:

C:mongodbinmongod.exe

If you need to open the command prompt window, always right-click on the command prompt icon and choose “Run as Administrator.” I ran the above statement with no parameters, and therefore the MongoDB server will use the default data directory, datadb.

You can run both the client and server on the same machine. To run a client, just open another command prompt window (also with “Run as Administrator”) and initiate the client by typing:

C:mongodbinmongo.exe

You now have both your MongoDB server and client running on your machine! This whole process probably took ten minutes. Amazingly easy.

A great resource for learning MongoDB is https://university.mongodb.com.

Note that the MongoDB shell is a full-featured JavaScript interpreter. You can therefore run any JavaScript commands such as:

> y = 100

100

> y / 20

5

You can also use all of the standard JavaScript libraries and functions.

When your MongoDB statement spans more than one line, press <Enter> to go to the next line. The MongoDB client knows whether the statement is complete or not, and if the statement is not complete, the client will allow you to continue writing it on the next line. Pressing <Enter> three times will cancel the statement and get you back to the > prompt.

To stop the MongoDB server, press Ctrl-C in the window that is running the server. To stop the MongoDB client, press Ctrl-C in the window that is running the client.

Key Points

  • NoSQL is a name for the category of databases built on non-relational technology.
  • There are four main differences between traditional relational databases and NoSQL databases: variety, structure, scaling, and focus.
  • With RDBMS, the focus is on data integrity. With NoSQL, the focus is on data performance and availability.
  • Document databases frequently store the business subject in one structure called a “document.”
  • Key–value databases allow the application to store its data in only two columns (“key” and “value”), with more complex information sometimes stored within the “value” columns.
  • Column-oriented databases work with more complex data types such as unformatted text and imagery, and this data can also be defined on the fly.
  • A graph database is designed for data whose relations are well represented as a set of nodes with an undetermined number of connections between those nodes.
  • MongoDB is known for high performance, high availability, and low cost, because of these four properties: document-oriented, extremely extensible, horizontally scalable, and the availability of many drivers.
  • It is easy to install MongoDB; both the client and server can be installed on the same machine.
  • A great site for learning MongoDB is https://university.mongodb.com.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset