Step 4: Pausing, restarting, and terminating the Spark cluster

When your computation is done, it is better to stop your cluster to avoid additional cost. To stop your clusters, execute the following commands from your local machine:

$ SPARK_HOME/ec2/spark-ec2 --region=<ec2-region> stop <cluster-name>

For our case, it would be the following:

$ SPARK_HOME/ec2/spark-ec2 --region=eu-west-1 stop ec2-spark-cluster-1

To restart the cluster later on, execute the following command:

$ SPARK_HOME/ec2/spark-ec2 -i <key-file> --region=<ec2-region> start <cluster-name>

For our case, it will be something like the following:

$ SPARK_HOME/ec2/spark-ec2 --identity-file=/usr/local/key/-key-pair.pem --region=eu-west-1 start ec2-spark-cluster-1

Finally, to terminate your Spark cluster on AWS we use the following code:

$ SPARK_HOME/ec2/spark-ec2 destroy <cluster-name>

In our case, it would be the following:

$ SPARK_HOME /spark-ec2 --region=eu-west-1 destroy ec2-spark-cluster-1

Spot instances are great for reducing AWS costs, sometimes cutting instance costs by a whole order of magnitude. A step-by-step guideline using this facility can be accessed at http://blog.insightdatalabs.com/spark-cluster-step-by-step/.

Sometimes, it's difficult to move large dataset, say 1 TB of raw data file. In that case, and if you want your application to scale up even more for large-scale datasets, the fastest way of doing so is loading them from Amazon S3 or EBS device to HDFS on your nodes and specifying the data file path using hdfs://.

The data files or any other files (data, jars, scripts, and so on) can be hosted on HDFS to make them highly accessible:
1. Having the URIs/URLs (including HTTP) via http://
2. Using the Amazon S3 via s3n://
3. Using the HDFS via hdfs://
If you set HADOOP_CONF_DIR environment variable, the parameter is usually set as hdfs://...; otherwise file://.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset