The following are the steps to build the Spark source code with Maven:
- Increase MaxPermSize of the heap:
$ echo "export _JAVA_OPTIONS="-XX:MaxPermSize=1G"" >>
/home/hduser/.bashrc
- Open a new terminal window and download the Spark source code from GitHub:
$ wget https://github.com/apache/spark/archive/branch-2.1.zip
- Unpack the archive:
$ unzip branch-2.1.zip
- Rename unzipped folder to spark:
$ mv spark-branch-2.1 spark
- Move to the spark directory:
$ cd spark
- Compile the sources with the YARN-enabled, Hadoop version 2.7, and Hive-enabled flags and skip the tests for faster compilation:
$ mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -
DskipTests clean package
- Move the conf folder to the etc folder so that it can be turned into a symbolic link:
$ sudo mv spark/conf /etc/
- Move the spark directory to /opt as it's an add-on software package:
$ sudo mv spark /opt/infoobjects/spark
- Change the ownership of the spark home directory to root:
$ sudo chown -R root:root /opt/infoobjects/spark
- Change the permissions of the spark home directory, namely 0755 = user:rwx group:r-x world:r-x:
$ sudo chmod -R 755 /opt/infoobjects/spark
- Move to the spark home directory:
$ cd /opt/infoobjects/spark
- Create a symbolic link:
$ sudo ln -s /etc/spark conf
- Put the Spark executable in the path by editing .bashrc:
$ echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >>
/home/hduser/.bashrc
- Create the log directory in /var:
$ sudo mkdir -p /var/log/spark
- Make hduser the owner of Spark's log directory:
$ sudo chown -R hduser:hduser /var/log/spark
- Create Spark's tmp directory:
$ mkdir /tmp/spark
- Configure Spark with the help of the following command lines:
$ cd /etc/spark
$ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop"
>> spark-env.sh
$ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop"
>> spark-env.sh
$ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh
$ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh