Blaze can abstract many different data structures and expose a single, easy-to-use API. This helps to get a consistent behavior and reduce the need to learn multiple interfaces to handle data. If you know pandas, there is not really that much to learn, as the differences in the syntax are subtle. We will go through some examples to illustrate this.
Getting data from a NumPy array into the DataShape object of Blaze is extremely easy. First, let's create a simple NumPy array: we first load NumPy and then create a matrix with two rows and three columns:
import numpy as np simpleArray = np.array([ [1,2,3], [4,5,6] ])
Now that we have an array, we can abstract it with Blaze's DataShape structure:
simpleData_np = bl.Data(simpleArray)
That's it! Simple enough.
In order to peek inside the structure you can use the .peek()
method:
simpleData_np.peek()
You should see an output similar to what is shown in the following screenshot:
You can also use (familiar to those of you versed in pandas' syntax) the .head(...)
method.
If you want to retrieve the first column of your DataShape, you can use indexing:
simpleData_np[0]
You should see a table, as shown here:
On the other hand, if you were interested in retrieving a row, all you would have to do (like in NumPy) is transpose your DataShape:
simpleData_np.T[0]
What you will then get is presented in the following figure:
Notice that the name of the column is None
. DataShapes, just like pandas' DataFrames, support named columns. Thus, let's specify the names of our fields:
simpleData_np = bl.Data(simpleArray, fields=['a', 'b', 'c'])
Now you can retrieve the data simply by calling the column by its name:
simpleData_np['b']
In return, you will get the following output:
As you can see, defining the fields transposes the NumPy array and, now, each element of the array forms a row, unlike when we first created the simpleData_np
.
Since pandas' DataFrame
internally uses NumPy data structures, translating a DataFrame to DataShape is effortless.
First, let's create a simple DataFrame. We start by importing pandas:
import pandas as pd
Next, we create a DataFrame:
simpleDf = pd.DataFrame([ [1,2,3], [4,5,6] ], columns=['a','b','c'])
We then transform it into a DataShape:
simpleData_df = bl.Data(simpleDf)
You can retrieve data in the same manner as with the DataShape created from the NumPy array. Use the following command:
simpleData_df['a']
A DataShape object can be created directly from a .csv
file. In this example, we will use a dataset that consists of 404,536 traffic violations that happened in the Montgomery county of Maryland.
We downloaded the data from https://catalog.data.gov/dataset/traffic-violations-56dda on 8/23/16; the dataset is updated daily, so the number of traffic violations might differ if you retrieve the dataset at a later date.
We store the dataset in the ../Data
folder locally. However, we modified the dataset slightly so we could store it in the MongoDB: in its original form, with date columns, reading data back from MongoDB caused errors. We filed a bug with Blaze to fix this issue https://github.com/blaze/blaze/issues/1580:
import odo traffic = bl.Data('../Data/TrafficViolations.csv')
If you do not know the names of the columns in any dataset, you can get these from the DataShape. To get a list of all the fields, you can use the following command:
print(traffic.fields)
Those of you familiar with pandas would easily recognize the similarity between the .fields
and .columns
attributes, as these work in essentially the same way - they both return the list of columns (in the case of pandas DataFrame), or the list of fields, as columns are called in the case of Blaze DataShape.
Blaze can also read directly from a GZipped
archive, saving space:
traffic_gz = bl.Data('../Data/TrafficViolations.csv.gz')
To validate that we get exactly the same data, let's retrieve the first two records from each structure. You can either call the following:
traffic.head(2)
Or you can choose to call:
traffic_gz.head(2)
It produces the same results (columns abbreviated here):
It is easy to notice, however, that it takes significantly more time to retrieve the data from the archived file because Blaze needs to decompress the data.
You can also read from multiple files at one time and create one big dataset. To illustrate this, we have split the original dataset into four GZipped
datasets by year of violation (these are stored in the ../Data/Years
folder).
Blaze uses odo
to handle saving DataShapes
to a variety of formats. To save traffic
data for traffic violations by year you can call odo
like this:
import odo for year in traffic.Stop_year.distinct().sort(): odo.odo(traffic[traffic.Stop_year == year], '../Data/Years/TrafficViolations_{0}.csv.gz' .format(year))
The preceding instruction saves the data into a GZip
archive, but you can save it to any of the formats mentioned earlier. The first argument to the .odo(...)
method specifies the input object (in our case, the DataShape
with traffic violations that occurred in 2013), the second argument is the output object - the path to the file we want to save the data to. As we are about to learn - storing data is not limited to files only.
To read from multiple files you can use the asterisk character *
:
traffic_multiple = bl.Data( '../Data/Years/TrafficViolations_*.csv.gz') traffic_multiple.head(2)
The preceding snippet, once again, will produce a familiar table:
Blaze reading capabilities are not limited to .csv
or GZip
files only: you can read data from JSON or Excel files (both, .xls
and .xlsx
), HDFS, or bcolz formatted files.
To learn more about the bcolz format, check its documentation at https://github.com/Blosc/bcolz.
Blaze can also easily read from SQL databases such as PostgreSQL or SQLite. While SQLite would normally be a local database, the PostgreSQL can be run either locally or on a server.
Blaze, as mentioned earlier, uses odo
in the background to handle the communication to and from the databases.
odo
is one of the requirements for Blaze and it gets installed along with the package. Check it out here https://github.com/blaze/odo.
In order to execute the code in this section, you will need two things: a running local instance of a PostgreSQL database, and a locally running MongoDB database.
In order to install PostgreSQL, download the package from http://www.postgresql.org/download/ and follow the installation instructions for your operating system found there.
To install MongoDB, go to https://www.mongodb.org/downloads and download the package; the installation instructions can be found here http://docs.mongodb.org/manual/installation/.
Before you proceed, we assume that you have a PostgreSQL database up and running at http://localhost:5432/
, and MongoDB database running at http://localhost:27017
.
We have already loaded the traffic data to both of the databases and stored them in the traffic
table (PostgreSQL) or the traffic
collection (MongoDB).
If you do not know how to upload your data, I have explained this in my other book https://www.packtpub.com/big-data-and-business-intelligence/practical-data-analysis-cookbook.
Let's read the data from the PostgreSQL database now. The Uniform Resource Identifier (URI) for accessing a PostgreSQL database has the following syntax postgresql://<user_name>:<password>@<server>:<port>/<database>::<table>
.
To read the data from PostgreSQL, you just wrap the URI around .Data(...)
- Blaze will take care of the rest:
traffic_psql = bl.Data( 'postgresql://{0}:{1}@localhost:5432/drabast::traffic' .format('<your_username>', '<your_password>') )
We use Python's .format(...)
method to fill in the string with the appropriate data.
Substitute your credentials to access your PostgreSQL database in the previous example. If you want to read more about the .format(...)
method, you can check out the Python 3.5 documentation https://docs.python.org/3/library/string.html#format-string-syntax.
It is quite easy to output the data to either the PostgreSQL or SQLite databases. In the following example, we will output traffic violations that involved cars manufactured in 2016 to both PostgreSQL and SQLite databases. As previously noted, we will use odo
to manage the transfers:
traffic_2016 = traffic_psql[traffic_psql['Year'] == 2016] # Drop commands # odo.drop('sqlite:///traffic_local.sqlite::traffic2016') # odo.drop('postgresql://{0}:{1}@localhost:5432/drabast::traffic' .format('<your_username>', '<your_password>')) # Save to SQLite odo.odo(traffic_2016, 'sqlite:///traffic_local.sqlite::traffic2016') # Save to PostgreSQL odo.odo(traffic_2016, 'postgresql://{0}:{1}@localhost:5432/drabast::traffic' .format('<your_username>', '<your_password>'))
In a similar fashion to pandas, to filter the data, we effectively select the Year
column (the traffic_psql['Year']
part of the first line) and create a Boolean flag by checking whether each and every record in that column equals 2016
. By indexing the traffic_psql
object with such a truth vector, we extract only the records where the corresponding value equals True
.
The two commented out lines should be uncommented if you already have the traffic2016
tables in your databases; otherwise odo
will append the data to the end of the table.
The URI for SQLite is slightly different than the one for PostgreSQL; it has the following syntax sqlite://</relative/path/to/db.sqlite>::<table_name>
.
Reading data from the SQLite database should be trivial for you by now:
traffic_sqlt = bl.Data( 'sqlite:///traffic_local.sqlite::traffic2016' )
MongoDB has gained lots of popularity over the years. It is a simple, fast, and flexible document-based database. The database is a go-to storage solution for all full-stack developers, using the MEAN.js
stack: M here stands for Mongo (see http://meanjs.org).
Since Blaze is meant to work in a very familiar way no matter what your data source, reading from MongoDB is very similar to reading from PostgreSQL or SQLite databases:
traffic_mongo = bl.Data( 'mongodb://localhost:27017/packt::traffic' )