Tech Tots: Setting up Hadoop in clustered mode in Ubuntu

Thursday, December 8, 2011

Setting up Hadoop in clustered mode in Ubuntu

Overview

This entry details the steps I took to setup Hadoop in a clustered setup in Ubuntu 11.10. Hadoop version 0.20.205.0 was used to setup the environment. The Hadoop cluster consists of 3 servers/nodes:

node616 ==> namenode, tasktracker, datanode, jobtracker, secondarynamenode
node617 ==> datanode, jobtracker
node618 ==> datanode, jobtracker

In an actual production setup, thet namenode shouldn't act as datanode, jobtracker and secondarynamenode. But for the purpose of this setup, things will be simplified :)

Server setup

Ensure that the /etc/hosts file in all servers are updated properly. All my servers have the following entry:

192.168.1.1    node616
192.168.1.2    node617
192.168.1.3    node618

This is to ensure that the configuration files stay the same in all servers.

The following directories must be created beforehand to store Hadoop related data:

/opt/hdfs/cache ==> HDFS cache storage
/opt/hdfs/data ==> HDFS data node storage
/opt/hdfs/name ==> HDFS name node storage

SSH setup

Before we proceed to actual setup, the user running Hadoop must be able to ssh to the servers without a passphrase. Test this out by issuing the following command:

$ ssh node616

If it prompts for a password, execute the following commands:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

The public key needs to be copied to all data nodes/slaves once they're setup in a later stage.

Namenode setup

Obtain Hadoop binary distribution from the main site (http://hadoop.apache.org). Place it to a location in the server. I've used /opt/hadoop for all installations.

The extracted directory contents should look like the output below:

one616:/opt/hadoop# ls -l
total 7144
drwxr-xr-x 2 root root    4096 2011-11-25 16:58 bin
-rw-rw-r-- 1 root root 112062 2011-10-07 14:19 build.xml
drwxr-xr-x 4 root root    4096 2011-10-07 14:24 c++
-rw-rw-r-- 1 root root 433928 2011-10-07 14:19 CHANGES.txt
drwxr-xr-x 2 root root    4096 2011-11-30 12:23 conf
drwxr-xr-x 11 root root    4096 2011-10-07 14:19 contrib
drwxr-xr-x 3 root root    4096 2011-10-07 14:20 etc
-rw-rw-r-- 1 root root    6839 2011-10-07 14:19 hadoop-ant-0.20.205.0.jar
-rw-rw-r-- 1 root root 3700955 2011-10-07 14:24 hadoop-core-0.20.205.0.jar
-rw-rw-r-- 1 root root 142465 2011-10-07 14:19 hadoop-examples-0.20.205.0.jar
-rw-rw-r-- 1 root root 2487116 2011-10-07 14:24 hadoop-test-0.20.205.0.jar
-rw-rw-r-- 1 root root 287776 2011-10-07 14:19 hadoop-tools-0.20.205.0.jar
drwxr-xr-x 3 root root    4096 2011-10-07 14:20 include
drwxr-xr-x 2 root root    4096 2011-11-22 14:28 ivy
-rw-rw-r-- 1 root root   10389 2011-10-07 14:19 ivy.xml
drwxr-xr-x 6 root root    4096 2011-11-22 14:28 lib
drwxr-xr-x 2 root root    4096 2011-11-22 14:28 libexec
-rw-rw-r-- 1 root root   13366 2011-10-07 14:19 LICENSE.txt
drwxr-xr-x 4 root root    4096 2011-12-07 12:10 logs
-rw-rw-r-- 1 root root     101 2011-10-07 14:19 NOTICE.txt
drwxr-xr-x 4 root root    4096 2011-11-29 10:36 out
-rw-rw-r-- 1 root root    1366 2011-10-07 14:19 README.txt
drwxr-xr-x 2 root root    4096 2011-11-22 14:28 sbin
drwxr-xr-x 4 root root    4096 2011-10-07 14:20 share
drwxr-xr-x 9 root root    4096 2011-10-07 14:19 webapps

Navigate to the conf directory and edit the core-site.xml file. The default file should look like the following:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
</configuration>

Now we'll have to add 2 properties to make this a clustered setup:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://node616:9000</value>
     </property>

     <property>
         <name>hadoop.tmp.dir</name>
         <value>/opt/hdfs/cache</value>
     </property>
</configuration>

fs.default.name ==> Sets the default file system name. Since we're setting up a clustered environment, we'll set this to point to the namenode hostname and port; which in this case is the current machine.

hadoop.tmp.dir ==> A base for other temporary directories. Points to /tmp by default. But I had a problem with that as the Linux /tmp mount point is usually very small and caused problems. The following exception was thrown if I did not explicitly set this property:

java.io.IOException: File /user/root/testfile could only be replicated to 0 nodes, instead of 1

For more properties, please consult the following URL: http://hadoop.apache.org/common/docs/current/core-default.html

Next comes the hdfs-site.xml file which we'll customize it like the following:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
        <property>
                <name>dfs.name.dir</name>
                <value>/opt/hdfs/name</value>
        </property>
        <property>
                <name>dfs.data.dir</name>
                <value>/opt/hdfs/data</value>
        </property>
</configuration>

dfs.replication ==> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. Since we only have one node, we'll set it to 1 for the time being.

dfs.name.dir ==> Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.

dfs.data.dir ==> Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.

More configuration parameters here: http://hadoop.apache.org/common/docs/current/hdfs-default.html

Lastly, we come to the MapReduce site configuration file; mapred-site.xml. Output below shows the updated version:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>node616:9001</value>
     </property>
</configuration>

mapred.job.tracker ==> Node specific port which job tracker process is running on.

Edit the masters file and change the localhost value to node616. Ditto slaves file. By default, the host is set to localhost in both files. However, since we're using proper host names, it's better to update the entry so that all master and slaves nodes can use.

One last thing before starting up the service is to initialize the HDFS namenode directory. Execute the following command:

node616:/opt/hadoop/bin$ ./hadoop namenode -format

Everything should be configured correctly :) We can run Hadoop by going into the bin directory:

node616:/opt/hadoop/bin$ ./start-all.sh
starting namenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-namenode-node616.outWarning: $HADOOP_HOME is deprecated.
node616: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:node616: starting secondarynamenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-secondarynamenode-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:starting jobtracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-jobtracker-node616.outWarning: $HADOOP_HOME is deprecated.
node616: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:

A quick check via ps:

hadoop     29004 28217 0 09:31 pts/0    00:00:07 /usr/bin/java -Dproc_jar -Xmx256m -Dhadoop.log.dir=/opt/hadoop/libexec/../logs -Dhadoop.log.file=hadoop.log -Dhadoop.hom
hadoop     30630     1 1 16:07 pts/0    00:00:02 /usr/bin/java -Dproc_namenode -Xmx1000m -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=false -Dc
hadoop     30743     1 3 16:07 ?        00:00:04 /usr/bin/java -Dproc_datanode -Xmx1000m -server -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=f
hadoop     30858     1 1 16:07 ?        00:00:01 /usr/bin/java -Dproc_secondarynamenode -Xmx1000m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.
hadoop     30940     1 2 16:07 pts/0    00:00:02 /usr/bin/java -Dproc_jobtracker -Xmx1000m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.log.dir
hadoop     31048     1 2 16:07 ?        00:00:03 /usr/bin/java -Dproc_tasktracker -Xmx1000m -Dhadoop.log.dir=/opt/hadoop/libexec/../logs -Dhadoop.log.file=hadoop-hadoop-ta

Now that we can see all processes are running, go ahead and visit the following URLs:

http://node616:50030 ==> Map/Reduce admin
http://node616:50070 ==> NameNode admin

Let's try copying some files over to the HDFS:

node616:/opt/hadoop/bin$ ./hadoop fs -copyFromLocal 20m.log .

And let's see if it's there:

node616:~$ hadoop fs -ls
Found 1 items
-rw-r--r-- 3 hadoop supergroup 5840878894 2011-11-29 09:21 /user/hadoop/20m.log

So far so good :)

Once you're done, shutdown the Hadoop processes by executing stop-all.sh:

node616:/opt/hadoop/bin# ./stop-all.sh
stopping jobtracker
node616: stopping tasktracker
stopping namenode
node616: stopping datanode
node616: stopping secondarynamenode

Data nodes/slaves

Now that the namenode is up, we can proceed to setup our slaves.

Since we know that we'll have an additional two servers, we can add those entries in to the conf/slaves file:

node616
node617
node618

If there's a need to add more in the future, the slaves nodes can be added dynamically.

Edit the hdfs-site.xml file and change the dfs.replication value from 1 to 3. This ensures that the data blocks are replicated to 3 nodes (which is actually the default value).

Next, tar the entire Hadoop directory in the Namenode by executing the following command:

node616:/opt$ tar czvf hadoop.tar.gz hadoop

Transfer the tarball to the other servers (i.e. node617 and node618) and untar it. Make sure the /opt/hdfs directories have been created. Once the package has been extracted, go back to the namenode (node616) and execute the start-all.sh script. It should output the following:

node616:/opt/hadoop/bin# ./start-all.sh
starting namenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-namenode-node616.out
Warning: $HADOOP_HOME is deprecated.
node617: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node617.out
node616: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node616.out
node618: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node618.out
node617: Warning: $HADOOP_HOME is deprecated.
node617:
node616: Warning: $HADOOP_HOME is deprecated.
node616:
node618: Warning: $HADOOP_HOME is deprecated.
node618:
node616: starting secondarynamenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-secondarynamenode-node616.out
node616: Warning: $HADOOP_HOME is deprecated.
node616:
starting jobtracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-jobtracker-node616.out
Warning: $HADOOP_HOME is deprecated.
node618: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node618.out
node617: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node617.out
node616: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node616.out
node618: Warning: $HADOOP_HOME is deprecated.
node618:
node617: Warning: $HADOOP_HOME is deprecated.
node617:
node616: Warning: $HADOOP_HOME is deprecated.
node616:

Notice that the script will remotely start the data and task tracker services in the slave nodes. Visit the NameNode admin at http://node616:50070 to confirm the number of live nodes in the cluster.

Stopping/Starting Services in a node

To stop or start specific a specific service in just one node, use the bin/hadoop-daemon.sh script. As an example, to stop the datanode and the tasktracker processes in node618, I'll do: