Exploring data using Blaze

Blaze is an open source Python library, primarily developed by Continuum.io, leveraging Python Numpy arrays and Pandas dataframe. Blaze extends to out-of-core computing, while Pandas and Numpy are single-core.

Blaze offers an adaptable, unified, and consistent user interface across various backends. Blaze orchestrates the following:

  • Data: Seamless exchange of data across storages such as CSV, JSON, HDF5, HDFS, and Bcolz files.
  • Computation: Using the same query processing against computational backends such as Spark, MongoDB, Pandas, or SQL Alchemy.
  • Symbolic expressions: Abstract expressions such as join, group-by, filter, selection, and projection with a syntax similar to Pandas but limited in scope. Implements the split-apply-combine methods pioneered by the R language.

Blaze expressions are lazily evaluated and in that respect share a similar processing paradigm with Spark RDDs transformations.

Let's dive into Blaze by first importing the necessary libraries: numpy, pandas, blaze and odo. Odo is a spin-off of Blaze and ensures data migration from various backends. The commands are as follows:

import numpy as np
import pandas as pd
from blaze import Data, by, join, merge
from odo import odo
BokehJS successfully loaded.

We create a Pandas Dataframe by reading the parsed tweets saved in a CSV file, twts_csv:

twts_pd_df = pd.DataFrame(twts_csv_read, columns=Tweet01._fields)
twts_pd_df.head()

Out[65]:
id    created_at    user_id    user_name    tweet_text    url
1   598831111406510082   2015-05-14 12:43:57   14755521 raulsaeztapia    RT @pacoid: Great recap of @StrataConf EU in L...   http://www.mango-solutions.com/wp/2015/05/the-...
2   598831111406510082   2015-05-14 12:43:57   14755521 raulsaeztapia    RT @pacoid: Great recap of @StrataConf EU in L...   http://www.mango-solutions.com/wp/2015/05/the-...
3   98808944719593472   2015-05-14 11:15:52   14755521 raulsaeztapia   RT @alvaroagea: Simply @ApacheSpark http://t.c...    http://www.webex.com/ciscospark/
4   598808944719593472   2015-05-14 11:15:52   14755521 raulsaeztapia   RT @alvaroagea: Simply @ApacheSpark http://t.c...   http://sparkjava.com/

We run the Tweets Panda Dataframe to the describe() function to get some overall information on the dataset:

twts_pd_df.describe()
Out[66]:
id    created_at    user_id    user_name    tweet_text    url
count  19  19  19  19  19  19
unique    7  7   6   6     6   7
top    598808944719593472    2015-05-14 11:15:52    14755521 raulsaeztapia    RT @alvaroagea: Simply @ApacheSpark http://t.c...    http://bit.ly/1Hfd0Xm
freq    6    6    9    9    6    6

We convert the Pandas dataframe into a Blaze dataframe by simply passing it through the Data() function:

#
# Blaze dataframe
#
twts_bz_df = Data(twts_pd_df)

We can retrieve the schema representation of the Blaze dataframe by passing the schema function:

twts_bz_df.schema
Out[73]:
dshape("""{
  id: ?string,
  created_at: ?string,
  user_id: ?string,
  user_name: ?string,
  tweet_text: ?string,
  url: ?string
  }""")

The .dshape function gives a record count and the schema:

twts_bz_df.dshape
Out[74]: 
dshape("""19 * {
  id: ?string,
  created_at: ?string,
  user_id: ?string,
  user_name: ?string,
  tweet_text: ?string,
  url: ?string
  }""")

We can print the Blaze dataframe content:

twts_bz_df.data
Out[75]:
id    created_at    user_id    user_name    tweet_text    url
1    598831111406510082    2015-05-14 12:43:57   14755521 raulsaeztapia    RT @pacoid: Great recap of @StrataConf EU in L...    http://www.mango-solutions.com/wp/2015/05/the-...
2    598831111406510082    2015-05-14 12:43:57    14755521 raulsaeztapia    RT @pacoid: Great recap of @StrataConf EU in L...    http://www.mango-solutions.com/wp/2015/05/the-...
... 
18   598782970082807808    2015-05-14 09:32:39    1377652806 embeddedcomputer.nl    RT @BigDataTechCon: Moving Rating Prediction w...    http://buff.ly/1QBpk8J
19   598777933730160640     2015-05-14 09:12:38   294862170    Ellen Friedman   I'm still on Euro time. If you are too check o...http://bit.ly/1Hfd0Xm

We extract the column tweet_text and take the unique values:

twts_bz_df.tweet_text.distinct()
Out[76]:
    tweet_text
0   RT @pacoid: Great recap of @StrataConf EU in L...
1   RT @alvaroagea: Simply @ApacheSpark http://t.c...
2   RT @PrabhaGana: What exactly is @ApacheSpark a...
3   RT @Ellen_Friedman: I'm still on Euro time. If...
4   RT @BigDataTechCon: Moving Rating Prediction w...
5   I'm still on Euro time. If you are too check o...

We extract multiple columns ['id', 'user_name','tweet_text'] from the dataframe and take the unique records:

twts_bz_df[['id', 'user_name','tweet_text']].distinct()
Out[78]:
  id   user_name   tweet_text
0   598831111406510082   raulsaeztapia   RT @pacoid: Great recap of @StrataConf EU in L...
1   598808944719593472   raulsaeztapia   RT @alvaroagea: Simply @ApacheSpark http://t.c...
2   598796205091500032   John Humphreys   RT @PrabhaGana: What exactly is @ApacheSpark a...
3   598788561127735296   Leonardo D'Ambrosi   RT @Ellen_Friedman: I'm still on Euro time. If...
4   598785545557438464   Alexey Kosenkov   RT @Ellen_Friedman: I'm still on Euro time. If...
5   598782970082807808   embeddedcomputer.nl   RT @BigDataTechCon: Moving Rating Prediction w...
6   598777933730160640   Ellen Friedman   I'm still on Euro time. If you are too check o...

Transferring data using Odo

Odo is a spin-off project of Blaze. Odo allows the interchange of data. Odo ensures the migration of data across different formats (CSV, JSON, HDFS, and more) and across different databases (SQL databases, MongoDB, and so on) using a very simple predicate:

Odo(source, target)

To transfer to a database, the address is specified using a URL. For example, for a MongoDB database, it would look like this:

mongodb://username:password@hostname:port/database_name::collection_name

Let's run some examples of using Odo. Here, we illustrate odo by reading a CSV file and creating a Blaze dataframe:

filepath   = csvFpath
filename   = csvFname
filesuffix = csvSuffix
twts_odo_df = Data('{0}/{1}.{2}'.format(filepath, filename, filesuffix))

Count the number of records in the dataframe:

twts_odo_df.count()
Out[81]:
19

Display the five initial records of the dataframe:

twts_odo_df.head(5)
Out[82]:
  id   created_at   user_id   user_name   tweet_text   url
0   598831111406510082   2015-05-14 12:43:57   14755521   raulsaeztapia   RT @pacoid: Great recap of @StrataConf EU in L...   http://www.mango-solutions.com/wp/2015/05/the-...
1   598831111406510082   2015-05-14 12:43:57   14755521   raulsaeztapia   RT @pacoid: Great recap of @StrataConf EU in L...   http://www.mango-solutions.com/wp/2015/05/the-...
2   598808944719593472   2015-05-14 11:15:52   14755521   raulsaeztapia   RT @alvaroagea: Simply @ApacheSpark http://t.c...   http://www.webex.com/ciscospark/
3   598808944719593472   2015-05-14 11:15:52   14755521   raulsaeztapia   RT @alvaroagea: Simply @ApacheSpark http://t.c...   http://sparkjava.com/
4   598808944719593472   2015-05-14 11:15:52   14755521   raulsaeztapia   RT @alvaroagea: Simply @ApacheSpark http://t.c...   https://www.sparkfun.com/

Get dshape information from the dataframe, which gives us the number of records and the schema:

twts_odo_df.dshape
Out[83]:
dshape("var * {
  id: int64,
  created_at: ?datetime,
  user_id: int64,
  user_name: ?string,
  tweet_text: ?string,
  url: ?string
  }""")

Save a processed Blaze dataframe into JSON:

odo(twts_odo_distinct_df, '{0}/{1}.{2}'.format(jsonFpath, jsonFname, jsonSuffix))
Out[92]:
<odo.backends.json.JSONLines at 0x7f77f0abfc50>

Convert a JSON file to a CSV file:

odo('{0}/{1}.{2}'.format(jsonFpath, jsonFname, jsonSuffix), '{0}/{1}.{2}'.format(csvFpath, csvFname, csvSuffix))
Out[94]:
<odo.backends.csv.CSV at 0x7f77f0abfe10>
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset