While Zeppelin is powerful enough to quickly execute our Spark SQLs and visualize data, it is still an evolving platform. In this section, we'll take a brief look at the most popular visualizing framework in Python, called Bokeh, and use its (also fast evolving) Scala bindings to the framework. Breeze also has a visualization API called breeze-viz, which is built on JFreeChart. Unfortunately, at the time of writing this book, the API is not actively maintained, and therefore we won't be discussing it here.
The power of Zeppelin lies in the ability to share and view graphics on the browser. This is brought forth by the backing of the D3.js JavaScript visualization library. Bokeh is also backed by another JavaScript visualization library, called BokehJS. The Scala bindings library (bokeh-scala
) not only gives an easier way to construct glyphs (lines, circles, and so on) out of Scala objects, but also translates glyphs into a format that is understandable by the BokehJS JavaScript components.
There is a warning here: the Bokeh-Scala bindings are still evolving and act at a lower level. Sometimes, this is more cumbersome than its Python counterpart. That said, I am still sure that we all would be able to appreciate the amazing graphs that we can create right out of Scala.
In this recipe, we will be creating a scatter plot using iris data (https://archive.ics.uci.edu/ml/datasets/Iris), which has the length and width attributes of flowers belonging to three different species of the same plant. Drawing a scatter plot on this dataset involves a series of interesting substeps.
For the purpose of representing the iris data in a Breeze matrix, I have naïvely transformed the species categories into numbers:
This is available in irisNumeric.csv
. Later, we'll see how we can load the original iris data (iris.data
) into a Spark DataFrame and use that as a source for plotting.
For the sake of clarity, let's define what the various terms in Bokeh actually mean:
color
, x
, y
, width
, and so on.save
method in the document, it uses all the child renderers in the plot object and constructs a JSON from the wrapped elements. This JSON is eventually read by the BokehJS widgets to render the data in a visually pleasing manner. More than one plot can be rendered in the document by adding it to a grid plot (we'll look at how this is done in the next recipe, Creating a time series MultiPlot with Bokeh-Scala).A plot is a composition of multiple widgets/glyphs.
This consists of a series of steps:
Bokeh plots require our data to be in a format that it understands, but it's really easy to do it. All that we need to do is create a new source object that inherits from ColumnDataSource
. The other options are AjaxDataSource
and RemoteDataSource
.
So, let's overlay our Breeze data source on ColumnDataSource
:
import breeze.linalg._ object IrisSource extends ColumnDataSource { private val colormap = Map[Int, Color](0 -> Color.Red, 1 -> Color.Green, 2 -> Color.Blue) private val iris = csvread(file = new File("irisNumeric.csv"), separator = ',') val sepalLength = column(iris(::, 0)) val sepalWidth = column(iris(::, 1)) val petalLength = column(iris(::, 2)) val petalWidth = column(iris(::, 3)) val species = column(iris(::, 4)) }
The first line just reads irisNumeric.csv
using the csvread
function of the Breeze library. The color map is something that we'll be using later while plotting. The purpose of this map is to translate each species of flower into a different color. The final piece is where we convert the Breeze matrix into ColumnDataSource
. As required by ColumnDataSource
, we select and map specific columns in the Breeze matrix to corresponding columns.
Let's have our image's title as Iris Petal Length vs Width
and create a document object so that we can save the final HTML by the name IrisBokehBreeze.html
. Since we haven't specified the full path of the target file in the save
method, the file will be saved in the same directory as the project itself:
val plot = new Plot().title("Iris Petal Length vs Width") val document = new Document(plot) val file = document.save("IrisBokehBreeze.html") println(s"Saved the chart as ${file.url}")
Our plot has neither data nor any glyphs. Let's first create a marker object that marks the data point. There are a variety of marker objects to choose from: Asterisk
, Circle
, CircleCross
, CircleX
, Cross
, Diamond
, DiamondCross
, InvertedTriangle
, PlainX
, Square
, SquareCross
, SquareX
, and Triangle
.
Let's choose Diamond
for our purpose:
val diamond = new Diamond() .x(petalLength) .y(petalWidth) .fill_color(Color.Blue) .fill_alpha(0.5) .size(5) val dataPointRenderer = new GlyphRenderer().data_source(IrisSource).glyph(diamond)
While constructing the marker object, other than the UI attributes, we also say what the x and the y coordinates for it are. Note that we have also mentioned that the color of this marker is blue. We'll change that in a while using the color map.
The plot needs to know what the x
and y
data ranges of the plot are before rendering. Let's do that by creating two DataRange
objects and setting them to the plot:
val xRange = new DataRange1d().sources(petal_length :: Nil) val yRange = new DataRange1d().sources(petal_width :: Nil) plot.x_range(xRange).y_range(yRange)
Let's try and run the first cut of this program.
The following is the output:
We see that this needs a lot of work to be done. Let's do it bit by bit.
Let's now draw the axes, set their bounds, and add them to the plot's renderers. We also need to let the plot know which location each axis belongs to:
//X and Y Axis val xAxis = new LinearAxis().plot(plot).axis_label("Petal Length").bounds((1.0, 7.0)) val yAxis = new LinearAxis().plot(plot).axis_label("Petal Width").bounds((0.0, 2.5)) plot.below <<= (listRenderer => (xAxis :: listRenderer)) plot.left <<= (listRenderer => (yAxis :: listRenderer)) //Add the renderer to the plot plot.renderers := List(xAxis, yAxis, dataPointRenderer)
Here is the output:
All the data points are marked with blue as of now, but we would really like to differentiate the species visually. This is a simple two-step process:
speciesColor
) into our ColumnDataSource
to hold colors that represent the species:object IrisSource extends ColumnDataSource { private val colormap = Map[Int, Color](0 -> Color.Red, 1 -> Color.Green, 2 -> Color.Blue) private val iris = csvread(file = new File("irisNumeric.csv"), separator = ',') val sepalLength = column(iris(::, 0)) val sepalWidth = column(iris(::, 1)) val petalLength = column(iris(::, 2)) val petalWidth = column(iris(::, 3)) val speciesColor = column(species.value.map(v => colormap(v.round.toInt))) }
So, we assign red to Iris setosa, green to Iris versicolor and blue to Iris virginica.
diamond
marker to take this as input instead of accepting a static blue:val diamond = new Diamond() .x(petalLength) .y(petalWidth) .fill_color(speciesColor) .fill_alpha(0.5) .size(10)
The output is as follows:
It looks fairly okay now. Let's add some tools to the image. Bokeh has some nice tools that can be attached to the image: BoxSelectTool
, BoxZoomTool
, CrosshairTool
, HoverTool
, LassoSelectTool
, PanTool
, PolySelectTool
, PreviewSaveTool
, ResetTool
, ResizeTool
, SelectTool
, TapTool
, TransientSelectTool
, and WheelZoomTool
.
Let's add a few of them to see them for fun:
val panTool = new PanTool().plot(plot) val wheelZoomTool = new WheelZoomTool().plot(plot) val previewSaveTool = new PreviewSaveTool().plot(plot) val resetTool = new ResetTool().plot(plot) val resizeTool = new ResizeTool().plot(plot) val crosshairTool = new CrosshairTool().plot(plot) plot.tools := List(panTool, wheelZoomTool, previewSaveTool, resetTool, resizeTool, crosshairTool)
While we have the crosshair tool, which helps us locate the exact x and y values of a particular data point, it would be nice to have a data grid too. Let's add two data grids, one for the x axis and one for the y axis:
val xAxis = new LinearAxis().plot(plot).axis_label("Petal Length").bounds((1.0, 7.0)) val yAxis = new LinearAxis().plot(plot).axis_label("Petal Width").bounds((0.0, 2.5)) val xgrid = new Grid().plot(plot).axis(xAxis).dimension(0) val ygrid = new Grid().plot(plot).axis(yAxis).dimension(1)
Next, let's add the grids to the plot renderer list too:
plot.renderers := List(xAxis, yAxis, dataPointRenderer, xgrid, ygrid)
This step is a bit tricky in the Scala binding of Bokeh due to the lack of high-level graphing objects, such as scatter
. For now, let's cook up our own legend. The
legends
property of the Legend
object accepts a list of tuples - a label and a GlyphRenderer
pair. Let's explicitly create three GlyphRenderer
wrapping diamonds of three colors, which represent the species. We then add them to the plot:
val setosa = new Diamond().fill_color(Color.Red).size(10).fill_alpha(0.5) val setosaGlyphRnd=new GlyphRenderer().glyph(setosa) val versicolor = new Diamond().fill_color(Color.Green).size(10).fill_alpha(0.5) val versicolorGlyphRnd=new GlyphRenderer().glyph(versicolor) val virginica = new Diamond().fill_color(Color.Blue).size(10).fill_alpha(0.5) val virginicaGlyphRnd=new GlyphRenderer().glyph(virginica) val legends = List("setosa" -> List(setosaGlyphRnd), "versicolor" -> List(versicolorGlyphRnd), "virginica" -> List(virginicaGlyphRnd)) val legend = new Legend().orientation(LegendOrientation.TopLeft).plot(plot).legends(legends) plot.renderers := List(xAxis, yAxis, dataPointRenderer, xgrid, ygrid, legend, setosaGlyphRnd, virginicaGlyphRnd, versicolorGlyphRnd)
The code for this recipe can be found at https://github.com/arunma/ScalaDataAnalysisCookbook/blob/master/chapter4-visualization/src/main/scala/com/packt/scalada/viz/breeze.