Getting Started with Apache Hadoop 0.23.0

Hadoop 0.23.0 was released November 11, 2011. Being the future of the Hadoop platform, it's worth checking out even though it is an alpha release.

Note: Many of the instructions in this article came from trial and error, and there are lots of alternative (and possibly better ways) to configure the systems. Please feel free to suggest improvements in the comments. Also, all commands were only tested on Mac OS X.

Download

To get started, download the hadoop-0.23.0.tar.gz file from one of the mirrors here: http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-0.23.0.

Once downloaded, decompress the file. The bundled documentation is available in share/doc/hadoop/index.html

Notes for Users of Previous Versions of Hadoop

The directory layout of the hadoop distribution changed in hadoop 0.23.0 and 0.20.204 vs. previous versions. In particular, there are now sbin, libexec, and etc directories in the root of distribution tarball.

scripts and executables

In hadoop 0.23.0, a number of commonly used scripts from the bin directory have been removed or drastically changed. Specifically, the following scripts were removed (vs 0.20.205.0):

hadoop-config.sh
hadoop-daemon(s).sh
start-balancher.sh and stop-balancer.sh
start-dfs.sh and stop-dfs.sh
start-jobhistoryserver.sh and stop-jobhistoryserver.sh
start-mapred.sh and stop-mapred.sh
task-controller

The start/stop mapred-related scripts have been replaced by "map-reduce 2.0" scripts called yarn-*. The start-all.sh and stop-all.sh scripts no longer start or stop HDFS, but they are used to start and stop the yarn daemons. Finally, bin/hadoop has been deprecated. Instead, users should use bin/hdfs and bin/mapred.

Hadoop distributions now also include scripts in a sbin directory. The scripts include start-all.sh, start-dfs.sh, and start-balancer.sh (and the stop versions of those scripts).

configuration directories and files

The conf directory that comes with Hadoop is no longer the default configuration directory. Rather, Hadoop looks in etc/hadoop for configuration files. The libexec directory contains scripts hadoop-config.sh and hdfs-config.sh for configuring where Hadoop pulls configuration information, and it's possible to override the location of the configuration directory the following ways:

hdfs-config.sh calls hadoop-config.sh in $HADOOP_COMMON_HOME/libexec and $HADOOP_HOME/libexec
hadoop-config.sh accepts a --config option for specifying a config directory, or the directory can be specified using $HADOOP_CONF_DIR.

This scripts also accepts a --hosts parameter to specify the hosts / slaves
This script uses variables typically set in hadoop-env.sh, such as: $JAVA_HOME, $HADOOP_HEAPSIZE, $HADOOP_CLASSPATH, $HADOOP_LOG_DIR, $HADOOP_LOGFILE and more. See the file for a full list of variables.

Configure HDFS

To start hdfs, we will use sbin/start-dfs.sh which pulls configuration from etc/hadoop by default. We'll be putting configuration files in that directory, starting with core-site.xml. In core-site.xml, we must specify a fs.default.name:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

Next, we want to override the locations that the NameNode and DataNode store data so that it's in a non-transient location. The two relevant parameters are dfs.namenode.name.dir and dfs.datanode.data.dir. We also set replication to 1, since we're using a single datanode.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/Users/joecrow/Code/hadoop-0.23.0/data/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/Users/joecrow/Code/hadoop-0.23.0/data/hdfs/datanode</value>
  </property>
</configuration>

Notes:

as of HDFS-456 and HDFS-873, the namenode and datanode dirs should be specified with a full URI.
by default, hadoop starts up with 1000 megabytes of RAM allocated to each daemon. You can change this by adding a hadoop-env.sh to etc/hadoop. There's a template that can be added with: $ cp ./share/hadoop/common/templates/conf/hadoop-env.sh etc/hadoop

The template sets up a bogus value for HADOOPLOGDIR
HADOOPPIDDIR defaults to /tmp, so you might want to change that variable, too.

Start HDFS

Start the NameNode:

sbin/hadoop-daemon.sh start namenode

Start a DataNode:

sbin/hadoop-daemon.sh start datanode

(Optionally) start the SecondaryNameNode (this is not required for local development, but definitely for production).

sbin/hadoop-daemon.sh start secondarynamenode

To confirm that the processes are running, issue jps and look for lines for NameNode, DataNode and SecondaryNameNode:

$ jps
55036 Jps
55000 SecondaryNameNode
54807 NameNode
54928 DataNode

Notes:

the hadoop daemons log to the "logs" dir. Stdout goes to a file ending in ".out" and a logfile ends in ".log". If a daemon doesn't start up, check the file that includes the daemon name (e.g. logs/hadoop-joecrow-datanode-jcmba.local.out).
the commands might say "Unable to load realm info from SCDynamicStore" (at least on Mac OS X). This appears to be harmless output, see HADOOP-7489 for details.

Stopping HDFS

Eventually you'll want to stop HDFS. Here are the commands to execute, in the given order:

sbin/hadoop-daemon.sh stop secondarynamenode
sbin/hadoop-daemon.sh stop datanode
sbin/hadoop-daemon.sh stop namenode

Use jps to confirm that the daemons are no longer running.

Running an example MR Job

This section just gives the commands for configuring and starting the Resource Manager, Node Manager, and Job History Server, but it doesn't explain the details of those. Please refer to the References and Links section for more details.

The Yarn daemons use the conf directory in the distribution for configuration by default. Since we used etc/hadoop as the configuration directory for HDFS, it would be nice to use that as the config directory for mapreduce, too. As a result, we update the following files:

In conf/yarn-env.sh, add the following lines under the definition of YARNCONFDIR:

export HADOOPCONFDIR="${HADOOPCONFDIR:-$YARNHOME/etc/hadoop}"
export HADOOPCOMMONHOME="${HADOOPCOMMONHOME:-$YARNHOME}"
export HADOOPHDFSHOME="${HADOOPHDFSHOME:-$YARNHOME}"

In conf/yarn-site.xml, update the contents to:

<?xml version="1.0"?> <configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce.shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration>

Set the contents of etc/hadoop/mapred-site.xml to:

<?xml version="1.0"?> <?xml-stylesheet href="configuration.xsl"?> <configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>

Now, start up the yarn daemons:

$ bin/yarn-daemon.sh start resourcemanager $ bin/yarn-daemon.sh start nodemanager $ bin/yarn-daemon.sh start historyserver

A bunch of example jobs are available via the hadoop-examples jar. For example, to run the program that calculates pi:

$ bin/hadoop jar hadoop-mapreduce-examples-0.23.0.jar pi \ -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory \ -libjars modules/hadoop-mapreduce-client-jobclient-0.23.0.jar 16 10000

The command will output a lot of output, but towards the end you'll see:

Job Finished in 67.705 seconds Estimated value of Pi is 3.14127500000000000000

Notes

By default, the resource manager uses a number of IPC ports, including 8025, 8030, 8040, and 8141. The web UI is exposed on port 8088.

By default, the JobHistoryServer uses port 19888 for a web UI and port 10020 for IPC.

By default, the node manager uses port 9999 for a web UI and port 4344 for IPC. Port 8080 is used for something? Also so random port... 65176 ?

The resource manager has a "proxy" url that it uses to link-through to the JobHistoryServer UI. e.g.:

$ curl -I http://0.0.0.0:8088/proxy/application13226221033710001/jobhistory/job/job132262210337111
HTTP/1.1 302 Found
Content-Type: text/plain; charset=utf-8
Location: http://192.168.1.12:19888/jobhistory/job/job132262210337111/jobhistory/job/job132262210337111
Content-Length: 0
Server: Jetty(6.1.26)

Conclusion

While Hadoop 0.23 is an alpha-release, getting it up and running in psuedo-distributed mode isn't too difficult. The new architecture will take some getting used to for users of previous releases of Hadoop, but it's an exciting step forward.

Observations and Notes

There are a few bugs or gotchas that I discovered or verified to keep an eye on as you're going through these steps. These include:

HADOOP-7837 log4j isn't setup correctly when using sbin/start-dfs.sh
HDFS-2574 Deprecated parameters appear in the hdfs-site.xml templates.
HDFS-2595 misleading message when fs.default.name not set and running sbin/start-dfs.sh
HDFS-2553 BlockPoolScanner spinning in a loop (causes DataNode to peg one cpu to 100%).
HDFS-2608 NameNode webui references missing hadoop.css

References and Links

Apache Hadoop 0.23 at Hadoop World 2011 - slideshare.net
Building and Deploying MR2 - cloudera.com

Joe Crobak's Website