In Chapter 5, Scala and SQL through JDBC, and Chapter 6, Slick – A Functional Interface for SQL, you learned how to insert, transform, and read data in SQL databases. These databases remain (and are likely to remain) very popular in data science, but NoSQL databases are emerging as strong contenders.
The needs for data storage are growing rapidly. Companies are producing and storing more data points in the hope of acquiring better business intelligence. They are also building increasingly large teams of data scientists, who all need to access the data store. Maintaining constant access time as the data load increases requires taking advantage of parallel architectures: we need to distribute the database across several computers so that, as the load on the server increases, we can just add more machines to improve throughput.
In MySQL databases, the data is naturally split across different tables. Complex queries necessitate joining across several tables. This makes partitioning the database across different computers difficult. NoSQL databases emerged to fill this gap.
In this chapter, you will learn to interact with MongoDB, an open source database that offers high performance and can be distributed easily. MongoDB is one of the more popular NoSQL databases with a strong community. It offers a reasonable balance of speed and flexibility, making it a natural alternative to SQL for storing large datasets with uncertain query requirements, as might happen in data science. Many of the concepts and recipes in this chapter will apply to other NoSQL databases.
MongoDB is a document-oriented database. It contains collections of documents. Each document is a JSON-like object:
{ _id: ObjectId("558e846730044ede70743be9"), name: "Gandalf", age: 2000, pseudonyms: [ "Mithrandir", "Olorin", "Greyhame" ], possessions: [ { name: "Glamdring", type: "sword" }, { name: "Narya", type: "ring" } ] }
Just as in JSON, a document is a set of key-value pairs, where the values can be strings, numbers, Booleans, dates, arrays, or subdocuments. Documents are grouped in collections, and collections are grouped in databases.
You might be thinking that this is not very different from SQL: a document is similar to a row and a collection corresponds to a table. There are two important differences:
wizard2pseudonym
table with a row for each wizard-pseudonym pair. In MongoDB, we can just use an array. In practice, this means that we can normally use a single document to represent an entity (a customer, transaction, or wizard, for instance). In SQL, we would normally have to join across several tables to retrieve all the information on a specific entity.ALTER
TABLE
statements. The downside is that there is no easy way of enforcing our flexible schema on the database side.Note the _id
field: this is a unique key. MongoDB will generate one automatically if we insert a document without an _id
field.
This chapter gives recipes for interacting with a MongoDB database from Scala, including maintaining type safety and best practices. We will not cover advanced MongoDB functionality (such as aggregation or distributing the database). We will assume that you have MongoDB installed on your computer (http://docs.mongodb.org/manual/installation/). It will also help to have a very basic knowledge of MongoDB (we discuss some references at the end of this chapter, but any basic tutorial available online will be sufficient for the needs of this chapter).