Apache Zeppelin is a web-based notebook that enables data-driven, interactive analytics with built-in visualizations. It supports multiple languages with an interpreter framework. Currently, it supports interpreters such as Spark, Markdown, Shell, Hive, Phoenix, Tajo, Flink, Ignite, Lens, HBase, Cassandra, Elasticsearch, Geode, PostgreSQL, and Hawq. It can be used for data ingestion, discovery, analytics, and visualizations using notebooks similar to IPython Notebooks. Zeppelin notebooks recognize output from any language and visualize these using the same tools.
The Zeppelin project started as an incubator project in the Apache software foundation in December 2014 and became a top-level project in May 2016. Zeppelin mainly has four components, as shown in the architecture in Figure 6.3:
The components in the Zeppelin architecture are described as follows:
Each product has its own strengths and weaknesses. We need to understand the differences in order to use the right tool for the right use case. The following table shows you the differences between Jupyter and Zeppelin:
Jupyter |
Zeppelin | |
---|---|---|
Evolution |
A long history, large community support, and stable |
Relatively young |
Type of software |
Open source |
Open source with Apache Releases |
Visualization of results |
Using tools such as matplotlib |
Built-in tools in the notebook for graphs and charts |
Customization of forms |
No dynamic forms |
Dynamic forms with user-provided inputs |
Tab completion |
Jupyter provides tab completion |
Zeppelin does not provide tab completion yet |
Languages/components supported |
Over 40+ languages including Python, Julia, and R |
Interpreters such as Scala and Python with Apache Spark, Spark SQL, Hive, Markdown, Shell, HBase, Flink, Cassandra, Elasticsearch, Tajo, HDFS, Ignite, Lens, PostgreSQL, Hawq, Scalding, and Geode |
Mixing multiple languages |
Not easy to mix multiple languages in the same notebook. |
It's quite easy to mix multiple languages in the same notebook |
Implementation |
Python-based |
JVM-based |
Environments |
Jupyter is a generic tool that can be used in any environment |
Zeppelin is more suitable for Hadoop and Spark installations |
The latest Hortonworks Sandbox provides a preconfigured Zeppelin service that can be used to quickly try out. If you want to install Zepplelin on a cluster, there are a couple of ways to do so. Use the Hortonworks Ambari service or the manual installation method. The Ambari service can be used for Hortonworks-based installations and the manual installation can be used for Hortonworks, Cloudera, and MapR distributions.
Use the following instructions to install, configure, and start the Zeppelin service on Ambari:
VERSION=`hdp-select status hadoop-client | sed 's/hadoop-client - ([0-9].[0-9]).*/1/'` sudo git clone https://github.com/hortonworks-gallery/ambari-zeppelin-service.git/var/lib/ambari-server/resources/stacks/HDP/$VERSION/services/ZEPPELIN sudo ambari-server restart
Go to ipaddressofsandbox:8080
and log in with admin
/admin
credentials. The Apache Zeppelin service is now included in the stack and can be added as a service. At the bottom left of the Ambari page, click on Actions, click on Add Service, check Zeppelin service, configure it and deploy.
During the configuration step, change the following parameters as necessary:
spark.home
: Use the standard /usr/hdp/current/spark
or any custom Spark version installed.zeppelin.server.port
: This is the port number where the Zeppelin server listens. Use any unused port.zeppelin.setup.prebuilt
: Make it false
to get the latest code base.Use the following commands to install and configure the Apache Zeppelin service manually:
wget http://mirror.metrocast.net/apache/zeppelin/zeppelin-0.6.1/zeppelin-0.6.1-bin-all.tgz tar xzvf zeppelin-0.6.1-bin-all.tgz cd zeppelin-0.6.1-bin-all/conf
To access the Hive metastore, copy hive-site.xml
to the conf
directory of Zeppelin:
cp /etc/hive/conf/hive-site.xml .
Copy the configuration template files as follows:
cp zeppelin-env.sh.template zeppelin-env.sh cp zeppelin-site.xml.template zeppelin-site.xml
Add the following lines to the zeppelin-env.sh
file:
export JAVA_HOME=/usr/lib/jvm/java export MASTER=yarn-client export HDAOOP_CONF_DIR=/etc/hadoop/conf
Add the following lines to zeppelin-site.xml
:
<property> <name>zeppelin.server.addr</name> <value>sandbox.hortonworks.com</value> <description>Server address</description> </property> <property> <name>zeppelin.server.port</name> <value>9999</value> <description>Server port.</description> </property>
Finally, start the Zeppelin service from the bin
directory with the following command:
cd ../bin/ ./zeppelin-daemon.sh start
Now you can access your notebook at http://host.ip.address:9999
.
Zeppelin provides multiple interpreters in the same notebook. So, you can write Scala, Python, SQL, and others in the same notebook.
Click on the Notebook menu option at the top of the screen, and then click on Create new note and provide a meaningful name for the notebook. The newly created notebook can be opened from the main screen or from the Notebook menu option. Click on the newly created notebook and then on the interpreter binding button in the upper right corner. Click on the interpreters to bind or unbind the interpreters. You can change the order of interpreters by dragging and dropping them. The first one on the list will be the default interpreter in the notebook. Finally, click on the Save option at the bottom to save the changes.
Now, click on the Interpreter menu option at the top and then on edit to change Spark properties such as master
, spark.cores.max
, spark.executor.memory
, and args
as needed by the application. Click on Save to make changes to update and restart the interpreter with new settings. You can also restart any specific interpreter by clicking on the restart button.
You are now ready to code. As %spark
is the first on the list of interpreter binding, you don't need to type %spark
to write Scala code in the paragraph. However, in the paragraph, if you are writing any other code, say, PySpark, you need to type %pyspark
. Provide a Markdown text in the first paragraph to provide information about the notebook. Write code in the next set of paragraphs. Finally, to visualize the result, write %sql
or %table
in a separate paragraph.
Write code from previous chapters or use the Zeppelin Tutorial notebook that comes along with Zeppelin for a quick start. You can use the following code to analyze Ambari agent logs:
%pyspark words = sc.textFile('file:///var/log/ambari-agent/ambari-agent.log') .flatMap(lambda x: x.lower().split(' ')) .filter(lambda x: x.isalpha()).map(lambda x: (x, 1)) .reduceByKey(lambda a,b: a+b) sqlContext.registerDataFrameAsTable(sqlContext.createDataFrame(words, ['word', 'count']), 'words') %sql select word, max(count) from words group by word
The output of the preceding code looks similar to Figure 6.4:
If you get any errors, check the logs in the logs
directory of Zeppelin.
Hortonworks Gallery has prebuilt notebooks at https://github.com/hortonworks-gallery/zeppelin-notebooks to play with Spark, PySpark, Spark SQL, Spark Streaming, Hive, and so on.
Any existing notebook can be viewed at the ZeppelinHub Viewer:
https://www.zeppelinhub.com/viewer
There are multiple ways to share a notebook with others. Other users on the same cluster can access and run the notebook with the URL of the notebook. You can also share the notebook in report mode by clicking on the drop-down list in the upper right corner and then choosing report.