Tuesday 26 March 2013

GenericOptionsParser, Tool, and ToolRunner for running Hadoop Job

Hadoop comes with a few helper classes for making it easier to run jobs from the command line. GenericOptionsParser is a class that interprets common Hadoop command-line options and sets them on a Configuration object for your application to use as desired. You don’t usually use GenericOptionsParser directly, as it’s more convenient to implement the Tool interface and run your application with the ToolRunner, which uses GenericOptionsParser internally:
public interface Tool extends Configurable {
int run(String [] args) throws Exception;
}

Below example shows a very simple implementation of Tool, for running the Hadoop Map Reduce Job.
public class WordCountConfigured extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();

return 0;
}
}
public static void main(String[] args) throws Exception {
int exitCode = ToolRunner.run(new WordCountConfigured(), args);
System.exit(exitCode);
}

We make WordCountConfigured a subclass of Configured, which is an implementation of the Configurable interface. All implementations of Tool need to implement Configurable (since Tool extends it), and subclassing Configured is often the easiest way to achieve this. The run() method obtains the Configuration using Configurable’s getConf() method, and then iterates over it, printing each property to standard output.

WordCountConfigured’s main() method does not invoke its own run() method directly. Instead, we call ToolRunner’s static run() method, which takes care of creating a Configuration object for the Tool, before calling its run() method. ToolRunner also uses a GenericOptionsParser to pick up any standard options specified on the command line, and set them on the Configuration instance. We can see the effect of picking up the properties specified in conf/hadoop-localhost.xml by running the following command:
hadoop WordCountConfigured -conf conf/hadoop-localhost.xml -D mapred.job.tracker=localhost:10011 -D mapred.reduce.tasks=n

Options specified with -D take priority over properties from the configuration files. This is very useful: you can put defaults into configuration files, and then override them with the -D option as needed. A common example of this is setting the number of reducers for a MapReduce job via -D mapred.reduce.tasks=n. This will override the number of reducers set on the cluster, or if set in any client-side configuration files. The other options that GenericOptionsParser and ToolRunner support are listed in Table.

GenericOptionsParser and ToolRunner option Description



































Property>Description
-D property=valueSets the given Hadoop configuration property to the given value. Overrides any default or site properties in the configuration, and any properties set via the -conf option.
-conf filename ...Adds the given files to the list of resources in the configuration. This is a convenient way to set site properties, or to set a number of properties at once.
-fs uriSets the default filesystem to the given URI. Shortcut for -D fs.default.name=uri
-jt host:portSets the jobtracker to the given host and port. Shortcut for -D mapred.job.tracker=host:port
-files file1,file2,...Copies the specified files from the local filesystem (or any filesystem if a scheme is specified) to the shared filesystem used by the jobtracker (usually HDFS) and makes them available to MapReduce programs in the task’s working directory.
-archives archive1,archive2,...Copies the specified archives from the local filesystem (or any filesystem if a scheme is
specified) to the shared filesystem used by the jobtracker (usually HDFS), unarchives them, and makes them available to MapReduce programs in the task’s working directory.
-libjars jar1,jar2,...Copies the specified JAR files from the local filesystem (or any filesystem if a scheme is specified) to the shared filesystem used by the jobtracker (usually HDFS), and adds them to the MapReduce task’s classpath. This option is a useful way of shipping JAR files that a job is dependent on

1 comment: