Shared-user Hadoop clusters – using fair and other schedulers

When a user submits a job to Hadoop, this job needs to be assigned a resource (a computer/host) before execution. This process is called scheduling , and a scheduler decides when resources are assigned to a given job.

Hadoop is by default configured with a First in First out (FIFO) scheduler, which executes jobs in the same order as they arrive. However, for a deployment that is running many MapReduce jobs and shared by many users, more complex scheduling policies are needed.

The good news is that Hadoop scheduler is pluggable, and it comes with two other schedulers. Therefore, if required, it is possible to write your own scheduler as well.

  • Fair scheduler: This defines pools and over time; each pool gets around the same amount of resources.
  • Capacity scheduler: This defines queues, and each queue has a guaranteed capacity. The capacity scheduler shares computer resources allocated to a queue with other queues if those resources are not in use.

This recipe describes how to change the scheduler in Hadoop.

Getting ready

For this recipe, you need a working Hadoop deployment. Set up Hadoop using the Setting Hadoop in a distributed cluster environment recipe from Chapter 1, Getting Hadoop Up and Running in a Cluster.

How to do it...

  1. Shut down the Hadoop cluster.
  2. You need hadoop-fairscheduler-1.0.0.jar in the HADOOP_HOME/lib. However, from Hadoop 1.0.0 and higher releases, this JAR file is in the right place in the Hadoop distribution.
  3. Add the following code to the HADOOP_HOME/conf/mapred-site.xml:
    <property>
    <name>mapred.jobtracker.taskScheduler</name>
    <value>org.apache.hadoop.mapred.FairScheduler</value>
    </property>
  4. Restart Hadoop.
  5. Verify that the new scheduler has been applied by going to http://<job-tracker-host>:50030/scheduler in your installation. If the scheduler has been properly applied, the page will have the heading "Fair Scheduler Administration".

How it works...

When you follow the preceding steps, Hadoop will load the new scheduler settings when it is started. The fair scheduler shares equal amount of resources between users unless it has been configured otherwise.

The fair scheduler supports users to configure it through two ways. There are several parameters of the mapred.fairscheduler.* form, and we can configure these parameters via HADOOP_HOME/conf/mapred-site.xml. Also additional parameters can be configured via HADOOP_HOME/conf/fair-scheduler.xml. More details about fair scheduler can be found from HADOOP_HOME/docs/fair_scheduler.html.

There's more...

Hadoop also includes another scheduler called capacity scheduler that provides more fine-grained control than the fair scheduler. More details about the capacity scheduler can be found from HADOOP_HOME/docs/capacity_scheduler.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset