The Jupyter Notebook supports over 40 languages and integrates with Spark and Hadoop to query interactively and visualize results with ggplot2
, matplotlib
, and others.
The Jupyter project evolved from the IPython project. The IPython project has accumulated many languages other than Python over a period of time. As a result, the IPython name became irrelevant for the project, so the name has been changed to Jupyter with inspiration from the Julia, Python, and R languages. IPython will continue to exist as a Python kernel for Jupyter. In simple words, IPython supports the Python language and Jupyter is language-agnostic. Jupyter provides the following features:
.ipynb
to other formats such as .html
, .pdf
, .markdown
, and othersThe Jupyter web-based notebook will automatically detect installed kernels such as Python, Scala, R, and Julia. Notebook users will be able to select the programming language of their choice for each individual notebook from a drop-down menu. The UI logic such as syntax highlighting, logos, and help menus, will automatically be updated on the notebook as the programming language of a notebook is changed.
You need Python 3.3 and above or Python 2.7 for the installation of Jupyter. Once these requirements are met, installing Jupyter is quite easy with Anaconda or pip using the following commands:
conda install jupyter pip3 install jupyter
The IPython kernel will be automatically installed with the Juypter installation. If you want to install other kernels, go to https://github.com/ipython/ipython/wiki/IPython-kernels-for-other-languages, click on the kernel, and follow the procedure. This page provides a list of all the available languages for Jupyter as well.
Let's follow this procedure for the installation of the Jupyter Notebook (if you are not using conda
) on the Hortonworks Sandbox virtual machine. These instructions will work on other distributions (Cloudera and MapR) as well:
yum install nano centos-release-SCL zlib-devel bzip2-devel openssl-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libpng-devel libjpg-devel atlas-devel yum groupinstall "Development tools" yum install python27 source /opt/rh/python27/enable
pip
and then install the Jupyter Notebook:sudo yum -y install python-pip sudo pip install --upgrade pip pip install numpy scipy pandas scikit-learn tornado pyzmq pygments matplotlib jinja2 jsonschema pip install jinja2 --upgrade pip install jupyter
vi ~/start_ipython_notebook.sh #!/bin/bash source /opt/rh/python27/enable IPYTHON_OPTS="notebook --port 8889 --notebook-dir='/usr/hdp/current/spark-client/' --ip='*' --no-browser" pyspark chmod +x ~/start_ipython_notebook.sh
./start_ipython_notebook.sh
http://192.168.139.165:8889/tree#
Note that, by default, the Spark application starts in local mode. If you want to start with the YARN cluster manager, change your start command as follows:
[root@sandbox ~]# cat start_ipython_notebook.sh #!/bin/bash source /opt/rh/python27/enable IPYTHON_OPTS="notebook --port 8889 --notebook-dir='/usr/hdp/current/spark-client/' --ip='*' --no-browser" pyspark --master yarn
Hortonworks provides an unsupported Ambari service for Jupyter. The installation and management of Jupyter is easier with this service. Perform the following steps to install and start the Jupyter Service within Ambari:
git clone https://github.com/randerzander/jupyter-service sudo cp -r jupyter-service /var/lib/ambari-server/resources/stacks/HDP/2.4/services/ sudo ambari-server restart
Go to ipaddressofsandbox:8080
and log in with admin
/admin
credentials. The Jupyter service is now included in the stack and can be added as a service. Click on Actions | Add Service | Select Jupyter | Customize service and deploy. Start the service; the notebook can be viewed at port number 9999
on the browser. You can also add a port forwarding rule for port 9999
so that the notebook can be accessed with the address hostname 9999
.
Change the port number in the configuration if it is already bound to another service.
Before we get started with the analytics of Spark, let's learn some of the important features of the Jupyter Notebook.
Click on New in the upper right corner and select the Python 2 kernel to start a new notebook. Notebooks provide cells and output areas. You need to write code in a cell and then click on the execute button or press Shift + Enter. You can run regular operating system commands such as ls
, mkdir
, cp
, and others. Note that you get tab completion while typing the commands. IPython also provides magic commands that start with the %
symbol. A list of magic commands is available with the %lsmagic
command.
You can mark the cell with Code, Markdown, Raw NBConvert, or Heading with drop-down lists located on the toolbar. You can add rich text, links, mathematical formulas, code, and images in Markdown text to document within the notebook. Some of the example Markdowns are available at https://guides.github.com/features/mastering-markdown/. When you create a notebook, it is created with untitled.ipynb
, but you can save it with a filename by clicking at the top of the page.
Now, let's get started with analytics using Spark. You can execute any exercise from Chapter 3, Deep Dive into Apache Spark to Chapter 5, Real-Time Analytics with Spark Streaming and Structured Streaming. Commands can be executed one by one or you can put all the code in one cell for entire code execution. You can open multiple notebooks and run them on the same SparkContext. Let's run a simple word count and plot the output with matplotlib
:
from operator import add words = sc.parallelize(["hadoop spark hadoop spark mapreduce spark jupyter ipython notebook interactive analytics"]) counts = words.flatMap(lambda x: x.split(' ')) .map(lambda x: (x, 1)) .reduceByKey(add) .sortBy(lambda x: x[1]) %matplotlib inline import matplotlib.pyplot as plt def plot(counts): labels = map(lambda x: x[0], counts) values = map(lambda y: y[1], counts) plt.barh(range(len(values)), values, color='green') plt.yticks(range(len(values)), labels) plt.show() plot(counts.collect())
You will see the result as shown in Figure 6.2: