Our world is complex and no single approach exists that solves all problems. Likewise, in the data world one cannot solve all problems with one piece of technology.
Nowadays, any big technology company uses (in one form or another) a MapReduce paradigm to sift through terabytes (or even petabytes) of data collected daily. On the other hand, it is much easier to store, retrieve, extend, and update information about products in a document-type database (such as MongoDB) than it is in a relational database. Yet, persisting transaction records in a relational database aids later data summarizing and reporting.
Even these simple examples show that solving a vast array of business problems requires adapting to different technologies. This means that you, as a database manager, data scientist, or data engineer, would have to learn all of these separately if you were to solve your problems with the tools that are designed to solve them easily. This, however, does not make your company agile and is prone to errors and lots of tweaking and hacking needing to be done to your system.
Blaze abstracts most of the technologies and exposes a simple and elegant data structure and API.
In this chapter, you will learn:
If you run Anaconda it is easy to install Blaze. Just issue the following command in your CLI (see the Bonus Chapter 1, Installing Spark if you do not know what a CLI is):
conda install blaze
Once the command is issued, you will see a screen similar to the following screenshot:
We will later use Blaze to connect to the PostgreSQL and MongoDB databases, so we need to install some additional packages that Blaze will use in the background.
We will install SQL Alchemy and PyMongo, both of which are part of Anaconda:
conda install sqlalchemy conda install pymongo
All that is now left to do is to import Blaze itself in our notebook:
import blaze as bl