Before installing Hadoop and Spark, let's understand the versions of Hadoop and Spark. Spark is offered as a service in all three popular Hadoop distributions from Cloudera, Hortonworks, and MapR. The current Hadoop and Spark versions are 2.7.2 and 2.0 respectively as of writing this book. However, Hadoop distributions might have a lower version of Spark as Hadoop and Spark release cycles do not coincide.
For the upcoming chapters' practical exercises, let's use one of the free virtual machines (VM) from Cloudera, Hortonworks, and MapR, or use an open source version of Apache Spark. These VMs makes it easy to get started with Spark and Hadoop. The same exercises can be run on bigger clusters as well.
The prerequisites to use virtual machines on your laptop are as follows:
The instructions to download and run Cloudera Distribution for Hadoop (CDH) are as follows:
cloudera-quickstart-vm-5.x.x-x-vmware.vmx
file and click on Open.7 GB
(if your laptop has 8 GB RAM) or 8 GB
(if your laptop has more than 8 GB RAM). Increase the number of processors to four. Click on OK.cloudera
) and password (cloudera
).If you would like to use the Cloudera Quickstart Docker image, follow the instructions on http://blog.cloudera.com/blog/2015/12/docker-is-the-new-quickstart-option-for-apache-hadoop-and-cloudera.
The instructions to download and run Hortonworks Data Platform (HDP) Sandbox are as follows:
http://192.168.139.158/
. Click on View Advanced Options to see all the links.putty
as the root user and hadoop
as the initial password. You need to change the password on the first login. Also, run the ambari-admin-password-reset
command to reset Ambari admin password.ipaddressofsandbox:8080
with admin credentials created in the preceding step. Start the services needed in Ambari.C:WindowsSystem32driversetchosts
and enter the IP address and hostname with a space separator. You need admin rights to do this.The instructions to download and run MapR Sandbox are as follows:
mapr
.C:WindowsSystem32driversetchosts
and enter the IP address and hostname with a space separator.The instructions to download and run Apache Spark prebuilt binaries, in case you have a preinstalled Hadoop cluster, are given here. The following instructions can also be used to install the latest version of Spark and use it on the preceding VMs:
wget http://apache.mirrors.tds.net/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz tar xzvf spark-2.0.0-bin-hadoop2.7.tgz cd spark-2.0.0-bin-hadoop2.7
SPARK_HOME
and PATH
variables to the profile script as shown in the following commands so that these environment variables will be set every time you log in:[cloudera@quickstart ~]$ cat /etc/profile.d/spark2.sh export SPARK_HOME=/home/cloudera/spark-2.0.0-bin-hadoop2.7 export PATH=$PATH:/home/cloudera/spark-2.0.0-bin-hadoop2.7/bin
spark-env.sh
. Copy the template files in the conf
directory:cp conf/spark-env.sh.template conf/spark-env.sh cp conf/spark-defaults.conf.template conf/spark-defaults.conf vi conf/spark-env.sh export HADOOP_CONF_DIR=/etc/hadoop/conf export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
hive-site.xml
to the conf
directory of Spark.cp /etc/hive/conf/hive-site.xml conf/
ERROR
in the spark-2.0.0-bin-hadoop2.7/conf/log4j.properties
file after copying the template file.Programming languages version requirements to run Spark:
Java: 7+
Python: 2.6+/3.1+
R: 3.1+
Scala: Spark 1.6 and below 2.10, and Spark 2.0 and above 2.11
Note that the preceding virtual machines are single node clusters. If you are planning to set up multi-node clusters, follow the guidelines as per the distribution, such as CDH, HDP, or MapR. If you are planning to use a standalone cluster manager, the setup is described in the following chapter.