Blaze is an open source Python library, primarily developed by Continuum.io, leveraging Python Numpy arrays and Pandas dataframe. Blaze extends to out-of-core computing, while Pandas and Numpy are single-core.
Blaze offers an adaptable, unified, and consistent user interface across various backends. Blaze orchestrates the following:
Blaze expressions are lazily evaluated and in that respect share a similar processing paradigm with Spark RDDs transformations.
Let's dive into Blaze by first importing the necessary libraries: numpy
, pandas
, blaze
and odo
. Odo is a spin-off of Blaze and ensures data migration from various backends. The commands are as follows:
import numpy as np import pandas as pd from blaze import Data, by, join, merge from odo import odo BokehJS successfully loaded.
We create a Pandas Dataframe
by reading the parsed tweets saved in a CSV file, twts_csv
:
twts_pd_df = pd.DataFrame(twts_csv_read, columns=Tweet01._fields) twts_pd_df.head() Out[65]: id created_at user_id user_name tweet_text url 1 598831111406510082 2015-05-14 12:43:57 14755521 raulsaeztapia RT @pacoid: Great recap of @StrataConf EU in L... http://www.mango-solutions.com/wp/2015/05/the-... 2 598831111406510082 2015-05-14 12:43:57 14755521 raulsaeztapia RT @pacoid: Great recap of @StrataConf EU in L... http://www.mango-solutions.com/wp/2015/05/the-... 3 98808944719593472 2015-05-14 11:15:52 14755521 raulsaeztapia RT @alvaroagea: Simply @ApacheSpark http://t.c... http://www.webex.com/ciscospark/ 4 598808944719593472 2015-05-14 11:15:52 14755521 raulsaeztapia RT @alvaroagea: Simply @ApacheSpark http://t.c... http://sparkjava.com/
We run the Tweets Panda Dataframe
to the describe()
function to get some overall information on the dataset:
twts_pd_df.describe() Out[66]: id created_at user_id user_name tweet_text url count 19 19 19 19 19 19 unique 7 7 6 6 6 7 top 598808944719593472 2015-05-14 11:15:52 14755521 raulsaeztapia RT @alvaroagea: Simply @ApacheSpark http://t.c... http://bit.ly/1Hfd0Xm freq 6 6 9 9 6 6
We convert the Pandas dataframe
into a Blaze dataframe
by simply passing it through the Data()
function:
# # Blaze dataframe # twts_bz_df = Data(twts_pd_df)
We can retrieve the schema representation of the Blaze dataframe
by passing the schema
function:
twts_bz_df.schema Out[73]: dshape("""{ id: ?string, created_at: ?string, user_id: ?string, user_name: ?string, tweet_text: ?string, url: ?string }""")
The .dshape
function gives a record count and the schema:
twts_bz_df.dshape Out[74]: dshape("""19 * { id: ?string, created_at: ?string, user_id: ?string, user_name: ?string, tweet_text: ?string, url: ?string }""")
We can print the Blaze dataframe
content:
twts_bz_df.data Out[75]: id created_at user_id user_name tweet_text url 1 598831111406510082 2015-05-14 12:43:57 14755521 raulsaeztapia RT @pacoid: Great recap of @StrataConf EU in L... http://www.mango-solutions.com/wp/2015/05/the-... 2 598831111406510082 2015-05-14 12:43:57 14755521 raulsaeztapia RT @pacoid: Great recap of @StrataConf EU in L... http://www.mango-solutions.com/wp/2015/05/the-... ... 18 598782970082807808 2015-05-14 09:32:39 1377652806 embeddedcomputer.nl RT @BigDataTechCon: Moving Rating Prediction w... http://buff.ly/1QBpk8J 19 598777933730160640 2015-05-14 09:12:38 294862170 Ellen Friedman I'm still on Euro time. If you are too check o...http://bit.ly/1Hfd0Xm
We extract the column tweet_text
and take the unique values:
twts_bz_df.tweet_text.distinct() Out[76]: tweet_text 0 RT @pacoid: Great recap of @StrataConf EU in L... 1 RT @alvaroagea: Simply @ApacheSpark http://t.c... 2 RT @PrabhaGana: What exactly is @ApacheSpark a... 3 RT @Ellen_Friedman: I'm still on Euro time. If... 4 RT @BigDataTechCon: Moving Rating Prediction w... 5 I'm still on Euro time. If you are too check o...
We extract multiple columns ['id', 'user_name','tweet_text']
from the dataframe
and take the unique records:
twts_bz_df[['id', 'user_name','tweet_text']].distinct() Out[78]: id user_name tweet_text 0 598831111406510082 raulsaeztapia RT @pacoid: Great recap of @StrataConf EU in L... 1 598808944719593472 raulsaeztapia RT @alvaroagea: Simply @ApacheSpark http://t.c... 2 598796205091500032 John Humphreys RT @PrabhaGana: What exactly is @ApacheSpark a... 3 598788561127735296 Leonardo D'Ambrosi RT @Ellen_Friedman: I'm still on Euro time. If... 4 598785545557438464 Alexey Kosenkov RT @Ellen_Friedman: I'm still on Euro time. If... 5 598782970082807808 embeddedcomputer.nl RT @BigDataTechCon: Moving Rating Prediction w... 6 598777933730160640 Ellen Friedman I'm still on Euro time. If you are too check o...
Odo is a spin-off project of Blaze. Odo allows the interchange of data. Odo ensures the migration of data across different formats (CSV, JSON, HDFS, and more) and across different databases (SQL databases, MongoDB, and so on) using a very simple predicate:
Odo(source, target)
To transfer to a database, the address is specified using a URL. For example, for a MongoDB database, it would look like this:
mongodb://username:password@hostname:port/database_name::collection_name
Let's run some examples of using Odo. Here, we illustrate odo
by reading a CSV file and creating a Blaze dataframe
:
filepath = csvFpath filename = csvFname filesuffix = csvSuffix twts_odo_df = Data('{0}/{1}.{2}'.format(filepath, filename, filesuffix))
Count the number of records in the dataframe
:
twts_odo_df.count() Out[81]: 19
Display the five initial records of the dataframe
:
twts_odo_df.head(5) Out[82]: id created_at user_id user_name tweet_text url 0 598831111406510082 2015-05-14 12:43:57 14755521 raulsaeztapia RT @pacoid: Great recap of @StrataConf EU in L... http://www.mango-solutions.com/wp/2015/05/the-... 1 598831111406510082 2015-05-14 12:43:57 14755521 raulsaeztapia RT @pacoid: Great recap of @StrataConf EU in L... http://www.mango-solutions.com/wp/2015/05/the-... 2 598808944719593472 2015-05-14 11:15:52 14755521 raulsaeztapia RT @alvaroagea: Simply @ApacheSpark http://t.c... http://www.webex.com/ciscospark/ 3 598808944719593472 2015-05-14 11:15:52 14755521 raulsaeztapia RT @alvaroagea: Simply @ApacheSpark http://t.c... http://sparkjava.com/ 4 598808944719593472 2015-05-14 11:15:52 14755521 raulsaeztapia RT @alvaroagea: Simply @ApacheSpark http://t.c... https://www.sparkfun.com/
Get dshape
information from the dataframe
, which gives us the number of records and the schema:
twts_odo_df.dshape Out[83]: dshape("var * { id: int64, created_at: ?datetime, user_id: int64, user_name: ?string, tweet_text: ?string, url: ?string }""")
Save a processed Blaze dataframe
into JSON:
odo(twts_odo_distinct_df, '{0}/{1}.{2}'.format(jsonFpath, jsonFname, jsonSuffix)) Out[92]: <odo.backends.json.JSONLines at 0x7f77f0abfc50>
Convert a JSON file to a CSV file:
odo('{0}/{1}.{2}'.format(jsonFpath, jsonFname, jsonSuffix), '{0}/{1}.{2}'.format(csvFpath, csvFname, csvSuffix)) Out[94]: <odo.backends.csv.CSV at 0x7f77f0abfe10>