Getting Started with Apache Hadoop 0.23.0

Hadoop 0.23.0 was released November 11, 2011. Being the future of the Hadoop platform, it’s worth checking out even though it is an alpha release.

Note: Many of the instructions in this article came from trial and error, and there are lots of alternative (and possibly better ways) to configure the systems. Please feel free to suggest improvements in the comments. Also, all commands were only tested on Mac OS X.

Download

To get started, download the hadoop-0.23.0.tar.gz file from one of the mirrors here: http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-0.23.0.

Once downloaded, decompress the file. The bundled documentation is available in share/doc/hadoop/index.html

Notes for Users of Previous Versions of Hadoop

The directory layout of the hadoop distribution changed in hadoop 0.23.0 and 0.20.204 vs. previous versions. In particular, there are now sbin, libexec, and etc directories in the root of distribution tarball.

scripts and executables

In hadoop 0.23.0, a number of commonly used scripts from the bin directory have been removed or drastically changed.  Specifically, the following scripts were removed (vs 0.20.205.0):

  • hadoop-config.sh
  • hadoop-daemon(s).sh
  • start-balancher.sh and stop-balancer.sh
  • start-dfs.sh and stop-dfs.sh
  • start-jobhistoryserver.sh and stop-jobhistoryserver.sh
  • start-mapred.sh and stop-mapred.sh
  • task-controller

The start/stop mapred-related scripts have been replaced by “map-reduce 2.0″ scripts called yarn-*.  The start-all.sh and stop-all.sh scripts no longer start or stop HDFS, but they are used to start and stop the yarn daemons.  Finally, bin/hadoop has been deprecated. Instead, users should use bin/hdfs and bin/mapred.

Hadoop distributions now also include scripts in a sbin directory. The scripts include start-all.sh, start-dfs.sh, and start-balancer.sh (and the stop versions of those scripts).

configuration directories and files

The conf directory that comes with Hadoop is no longer the default configuration directory.  Rather, Hadoop looks in etc/hadoop for configuration files.  The libexec directory contains scripts hadoop-config.sh and hdfs-config.sh for configuring where Hadoop pulls configuration information, and it’s possible to override the location of the configuration directory the following ways:

  • hdfs-config.sh calls hadoop-config.sh in $HADOOP_COMMON_HOME/libexec and $HADOOP_HOME/libexec
  • hadoop-config.sh accepts a –config option for specifying a config directory, or the directory can be specified using $HADOOP_CONF_DIR.
    • This scripts also accepts a –hosts parameter to specify the hosts / slaves
    • This script uses variables typically set in hadoop-env.sh, such as: $JAVA_HOME, $HADOOP_HEAPSIZE, $HADOOP_CLASSPATH, $HADOOP_LOG_DIR, $HADOOP_LOGFILE and more.  See the file for a full list of variables.

Configure HDFS

To start hdfs, we will use sbin/start-dfs.sh which pulls configuration from etc/hadoop by default. We’ll be putting configuration files in that directory, starting with core-site.xml.  In core-site.xml, we must specify a fs.default.name:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

Next, we want to override the locations that the NameNode and DataNode store data so that it’s in a non-transient location. The two relevant parameters are dfs.namenode.name.dir and dfs.datanode.data.dir.  We also set replication to 1, since we’re using a single datanode.

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:/Users/joecrow/Code/hadoop-0.23.0/data/hdfs/namenode</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:/Users/joecrow/Code/hadoop-0.23.0/data/hdfs/datanode</value>
  </property>
</configuration>

Notes:

  • as of HDFS-456 and HDFS-873, the namenode and datanode dirs should be specified with a full URI.
  • by default, hadoop starts up with 1000 megabytes of RAM allocated to each daemon. You can change this by adding a hadoop-env.sh to etc/hadoop. There’s a template that can be added with: $ cp ./share/hadoop/common/templates/conf/hadoop-env.sh etc/hadoop
    • The template sets up a bogus value for HADOOP_LOG_DIR
    • HADOOP_PID_DIR defaults to /tmp, so you might want to change that variable, too.

Start HDFS

Start the NameNode:

sbin/hadoop-daemon.sh start namenode

Start a DataNode:

sbin/hadoop-daemon.sh start datanode

(Optionally) start the SecondaryNameNode (this is not required for local development, but definitely for production).

sbin/hadoop-daemon.sh start secondarynamenode

To confirm that the processes are running, issue jps and look for lines for NameNode, DataNode and SecondaryNameNode:

$ jps
55036 Jps
55000 SecondaryNameNode
54807 NameNode
54928 DataNode

Notes:

  • the hadoop daemons log to the “logs” dir.  Stdout goes to a file ending in “.out” and a logfile ends in “.log”. If a daemon doesn’t start up, check the file that includes the daemon name (e.g. logs/hadoop-joecrow-datanode-jcmba.local.out).
  • the commands might say “Unable to load realm info from SCDynamicStore” (at least on Mac OS X). This appears to be harmless output, see HADOOP-7489 for details.

Stopping HDFS

Eventually you’ll want to stop HDFS. Here are the commands to execute, in the given order:

sbin/hadoop-daemon.sh stop secondarynamenode
sbin/hadoop-daemon.sh stop datanode
sbin/hadoop-daemon.sh stop namenode

Use jps to confirm that the daemons are no longer running.

Running an example MR Job

This section just gives the commands for configuring and starting the Resource Manager, Node Manager, and Job History Server, but it doesn’t explain the details of those. Please refer to the References and Links section for more details.

The Yarn daemons use the conf directory in the distribution for configuration by default. Since we used etc/hadoop as the configuration directory for HDFS, it would be nice to use that as the config directory for mapreduce, too.  As a result, we update the following files:

In conf/yarn-env.sh, add the following lines under the definition of YARN_CONF_DIR:

export HADOOP_CONF_DIR="${HADOOP_CONF_DIR:-$YARN_HOME/etc/hadoop}"
export HADOOP_COMMON_HOME="${HADOOP_COMMON_HOME:-$YARN_HOME}"
export HADOOP_HDFS_HOME="${HADOOP_HDFS_HOME:-$YARN_HOME}"

In conf/yarn-site.xml, update the contents to:

<?xml version="1.0"?>
<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce.shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
</configuration>

Set the contents of etc/hadoop/mapred-site.xml to:

<?xml version="1.0"?>
<?xml-stylesheet href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

Now, start up the yarn daemons:

 $ bin/yarn-daemon.sh start resourcemanager
 $ bin/yarn-daemon.sh start nodemanager
 $ bin/yarn-daemon.sh start historyserver

A bunch of example jobs are available via the hadoop-examples jar. For example, to run the program that calculates pi:

$ bin/hadoop jar hadoop-mapreduce-examples-0.23.0.jar pi \
-Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory \
-libjars modules/hadoop-mapreduce-client-jobclient-0.23.0.jar 16 10000

The command will output a lot of output, but towards the end you’ll see:

Job Finished in 67.705 seconds
Estimated value of Pi is 3.14127500000000000000

Notes

  • By default, the resource manager uses a number of IPC ports, including 8025, 8030, 8040, and 8141.  The web UI is exposed on port 8088.
  • By default, the JobHistoryServer uses port 19888 for a web UI and port 10020 for IPC.
  • By default, the node manager uses port 9999 for a web UI and port 4344 for IPC. Port 8080 is used for something? Also so random port… 65176 ?
  • The resource manager has a “proxy” url that it uses to link-through to the JobHistoryServer UI. e.g.:
    $ curl -I http://0.0.0.0:8088/proxy/application_1322622103371_0001/jobhistory/job/job_1322622103371_1_1
    HTTP/1.1 302 Found
    Content-Type: text/plain; charset=utf-8
    Location: http://192.168.1.12:19888/jobhistory/job/job_1322622103371_1_1/jobhistory/job/job_1322622103371_1_1
    Content-Length: 0
    Server: Jetty(6.1.26)

Conclusion

While Hadoop 0.23 is an alpha-release, getting it up and running in psuedo-distributed mode isn’t too difficult.  The new architecture will take some getting used to for users of previous releases of Hadoop, but it’s an exciting step forward.

Observations and Notes

There are a few bugs or gotchas that I discovered or verified to keep an eye on as you’re going through these steps.  These include:

  • HADOOP-7837 log4j isn’t setup correctly when using sbin/start-dfs.sh
  • HDFS-2574 Deprecated parameters appear in the hdfs-site.xml templates.
  • HDFS-2595 misleading message when fs.default.name not set and running sbin/start-dfs.sh
  • HDFS-2553 BlockPoolScanner spinning in a loop (causes DataNode to peg one cpu to 100%).
  • HDFS-2608 NameNode webui references missing hadoop.css

References and Links

This entry was posted in hadoop. Bookmark the permalink.

19 Responses to Getting Started with Apache Hadoop 0.23.0

  1. Jie Li says:

    Good job! This is the best instruction so far!

    One more step: before starting the namenode, we need to format it by
    “bin/hadoop namenode -format”

    The other steps are all easy to follow. Thanks a lot!

  2. Nourl says:

    Thanks for your hard work and good post !
    I have run my hadoop program under guide of your blog.
    Thank you again!

  3. MRK says:

    HI,

    IN hadoop 0.23.o release there is no conf/masters file, which is used to specify the secondarynamenode host address. Could you please let me know how secondaryname node starts and where it will start. In this tutorial i have seen three commands to start HDFS.
    sbin/hadoop-daemon.sh stop secondarynamenode
    sbin/hadoop-daemon.sh stop datanode
    sbin/hadoop-daemon.sh stop namenode

    Datanode starts on the nodes mentioned in the conf/slaves file. let me know where secondary name node starts and how to configure the same.

  4. Pingback: Quora

  5. Praveen says:

    Hadoop 0.23 requires protoc 2.4.1+, Ubuntu 11.10 has 2.4.0. So, protoc source has to be got, built and installed.

    • joecrow says:

      Praveen, I wasn’t compiling the source at all in this example. If you download the distro, you should be able to run as is.

  6. srikanth says:

    Hi,
    Nice notes..but i am not able to start recource manager and node manager. did u face any problem like this?

  7. Pingback: Mongo-Hadoop Streaming – Bukan Tutorial | robee di sini!

  8. Krish says:

    Joe, If it hadn’t been for this blog post, I wouldn’t have CDH4B2 running, thanks for the great job. I am trying to run the included PI sample and am running into this weird issue, it complains about the output directory not existing. I thought hadoop created that automatically.

    hadoop jar /Users/hadoop/hadoop-0.23.1-cdh4.0.0b2/share/hadoop/mapreduce/hadoop-mapreduce-examples-0.23.1-cdh4.0.0b2.jar pi -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -libjars /Users/hadoop/hadoop-0.23.1-cdh4.0.0b2/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-0.23.1-cdh4.0.0b2.jar 16 10000

    12/05/29 16:35:49 INFO mapreduce.Job: map 0% reduce 0%
    12/05/29 16:35:49 INFO mapreduce.Job: Job job_1338334106848_0004 failed with state FAILED due to: Application application_1338334106848_0004 failed 1 times due to AM Container for appattempt_1338334106848_0004_000001 exited with exitCode: 127 due to:
    .Failing this attempt.. Failing the application.
    12/05/29 16:35:49 INFO mapreduce.Job: Counters: 0
    Job Finished in 3.599 seconds
    java.io.FileNotFoundException: File does not exist: hdfs://localhost:9000/user/hadoop/QuasiMonteCarlo_TMP_3_141592654/out/reduce-out

  9. Pingback: Hadoop – Installation (on Ubuntu) | Daniel Adeniji's – Learning in the Open

  10. rashmi says:

    Hi,

    For hadoop-2.0.0 installation on two linux machines, what should be values of fs.defaultFS and dfs.name.dir and dfs.data.dir properties on both name nodes????

    one machine hostname is rsi-nod-nsn1 and another one is rsi-nod-nsn2…

    i want to make both federated namenodes.. and both should be used as datanodes too..

    what should be configuration changes for the same? i am not finding masters, mapred-site.xml, and hadoop-env.sh files in hadoopHome/etc/hadoop folder… how do i make changes for these files?

  11. dheeren@yahoo.com says:

    To start history server use
    $ sbin/mr-jobhistory-daemon.sh start

    With cdh401 I was unable to start history server using
    $ ysbin/arn-daemon.sh start historyserver

    starting historyserver, logging to /tmp/yarn-hhhhuser-historyserver-rd1-nn1-1-sfm.ops.sfdc.net.out
    Exception in thread “main” java.lang.NoClassDefFoundError: historyserver
    Caused by: java.lang.ClassNotFoundException: historyserver
    at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
    Could not find the main class: historyserver. Program will exit.

  12. Hardik says:

    I get the same FileNotFoundException running “pi” example, anyone with some idea pleas help

    bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-0.23.1.jar pi -Dmapreduce.clientfactory.class.name=org.apache.hadoop.mapred.YarnClientFactory -libjars share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-0.23.1.jar 16 10000
    12/09/06 19:56:10 WARN conf.Configuration: mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
    Number of Maps = 16
    Samples per Map = 10000
    Wrote input for Map #0
    Wrote input for Map #1
    Wrote input for Map #2
    Wrote input for Map #3
    Wrote input for Map #4
    Wrote input for Map #5
    Wrote input for Map #6
    Wrote input for Map #7
    Wrote input for Map #8
    Wrote input for Map #9
    Wrote input for Map #10
    Wrote input for Map #11
    Wrote input for Map #12
    Wrote input for Map #13
    Wrote input for Map #14
    Wrote input for Map #15
    Starting Job
    12/09/06 19:56:22 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
    12/09/06 19:56:22 INFO input.FileInputFormat: Total input paths to process : 16
    12/09/06 19:56:23 INFO mapreduce.JobSubmitter: number of splits:16
    12/09/06 19:56:25 INFO mapred.ResourceMgrDelegate: Submitted application application_1346972652940_0002 to ResourceManager at /0.0.0.0:8040
    12/09/06 19:56:27 INFO mapreduce.Job: The url to track the job: http://10.215.12.11:8088/proxy/application_1346972652940_0002/
    12/09/06 19:56:27 INFO mapreduce.Job: Running job: job_1346972652940_0002
    12/09/06 19:56:55 INFO mapreduce.Job: Job job_1346972652940_0002 running in uber mode : false
    12/09/06 19:56:55 INFO mapreduce.Job: map 0% reduce 0%
    12/09/06 19:56:56 INFO mapreduce.Job: Job job_1346972652940_0002 failed with state FAILED due to: Application application_1346972652940_0002 failed 1 times due to AM Container for appattempt_1346972652940_0002_000001 exited with exitCode: 1 due to:
    .Failing this attempt.. Failing the application.
    12/09/06 19:56:56 INFO mapreduce.Job: Counters: 0
    Job Finished in 35.048 seconds
    java.io.FileNotFoundException: File does not exist: hdfs://localhost:9000/user/hardikpandya/QuasiMonteCarlo_TMP_3_141592654/out/reduce-out
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:729)
    at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1685)
    at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1709)
    at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:314)
    at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:351)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
    at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:360)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:72)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:144)
    at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:68)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:200)

  13. Keith Wiley says:

    Gah! Even though a comment pointed out that you forgot to mention formatting the namenode, and even though you replied that you would update the article…the omission is still there. I spent quite a while trying to figure out why the datanode would start but not the namenode (I figured it out by investigating the namenode error log in logs/).

    You should really put that update in the article. :-)

    Cheers!

  14. kasi says:

    Hi all,
    I’m using cyswin in windows,
    When i type the command
    user@user-PC ~/hadoop-0.23.7
    $ bin/hadoop namenode -format
    I’m getting the following error, can any one please help me in resolving the issue.
    Thanks in Advance
    Kasi

    cygpath: can’t convert empty path
    DEPRECATED: Use of this script to execute hdfs command is deprecated.
    Instead use the hdfs command for it.

    which: no hdfs in (./D:\cygwin\home\user\hadoop-0.23.7/bin)
    dirname: missing operand
    Try `dirname –help’ for more information.
    D:\cygwin\home\user\hadoop-0.23.7/bin/hdfs: line 24: /home/user/hadoop-0.23.7/../libexec/hdfs-config.sh: No such file or directory
    cygpath: can’t convert empty path
    D:\cygwin\home\user\hadoop-0.23.7/bin/hdfs: line 142: exec: : not found

  15. Deepak says:

    While formatting name node i am getting the following error:

    DEPRECATED: Use of this script to execute hdfs command is deprecated.
    Instead use the hdfs command for it.

    Error: Could not find or load main class org.apache.hadoop.hdfs.server.namenode.NameNode

    Please help me on this

    Deepak

  16. Harneet says:

    While configuring hadoop in cygwin on windows , when i run the command
    /bin/hadoop namenode
    it gives me the following error.
    /usr/local/hadoop-0.20.0/bin/../conf/hadoop-env.sh: line2: $’\r’ :command not found
    /usr/local/hadoop-0.20.0/bin/../conf/hadoop-env.sh: line7: $’\r’ :command not found
    /usr/local/hadoop-0.20.0/bin/../conf/hadoop-env.sh: line10: $’\r’ :command not found
    /usr/local/hadoop-0.20.0/bin/../conf/hadoop-env.sh: line13: $’\r’ :command not found
    /usr/local/hadoop-0.20.0/bin/../conf/hadoop-env.sh: line16: $’\r’ :command not found
    /usr/local/hadoop-0.20.0/bin/../conf/hadoop-env.sh: line19: $’\r’ :command not found
    /usr/local/hadoop-0.20.0/bin/../conf/hadoop-env.sh: line29: $’\r’ :command not found
    /usr/local/hadoop-0.20.0/bin/../conf/hadoop-env.sh: line32: $’\r’ :command not found
    bin/hadoop: line 258: /cygdrive/C/Program: No such file or directory
    /bin/java: No such file or directoryogram Files/Java/jdk1.7.0_03
    /bin/java: cannot execute: No such file or directorys/Java/jdk1.7.0_03

    Please help me to solve this error.

Leave a Reply