Overview
This entry details the steps I took to setup Hadoop in a clustered setup in Ubuntu 11.10. Hadoop version 0.20.205.0 was used to setup the environment. The Hadoop cluster consists of 3 servers/nodes:
- node616 ==> namenode, tasktracker, datanode, jobtracker, secondarynamenode
- node617 ==> datanode, jobtracker
- node618 ==> datanode, jobtracker
In an actual production setup, thet namenode shouldn't act as datanode, jobtracker and secondarynamenode. But for the purpose of this setup, things will be simplified :)
Server setup
Ensure that the /etc/hosts file in all servers are updated properly. All my servers have the following entry:
192.168.1.1 node616
192.168.1.2 node617
192.168.1.3 node618
This is to ensure that the configuration files stay the same in all servers.
The following directories must be created beforehand to store Hadoop related data:
- /opt/hdfs/cache ==> HDFS cache storage
- /opt/hdfs/data ==> HDFS data node storage
- /opt/hdfs/name ==> HDFS name node storage
SSH setup
Before we proceed to actual setup, the user running Hadoop must be able to ssh to the servers without a passphrase. Test this out by issuing the following command:
$ ssh node616
If it prompts for a password, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
The public key needs to be copied to all data nodes/slaves once they're setup in a later stage.
Namenode setup
Obtain Hadoop binary distribution from the main site (http://hadoop.apache.org). Place it to a location in the server. I've used
/opt/hadoop for all installations.
The extracted directory contents should look like the output below:
one616:/opt/hadoop# ls -l
total 7144
drwxr-xr-x 2 root root 4096 2011-11-25 16:58 bin
-rw-rw-r-- 1 root root 112062 2011-10-07 14:19 build.xml
drwxr-xr-x 4 root root 4096 2011-10-07 14:24 c++
-rw-rw-r-- 1 root root 433928 2011-10-07 14:19 CHANGES.txt
drwxr-xr-x 2 root root 4096 2011-11-30 12:23 conf
drwxr-xr-x 11 root root 4096 2011-10-07 14:19 contrib
drwxr-xr-x 3 root root 4096 2011-10-07 14:20 etc
-rw-rw-r-- 1 root root 6839 2011-10-07 14:19 hadoop-ant-0.20.205.0.jar
-rw-rw-r-- 1 root root 3700955 2011-10-07 14:24 hadoop-core-0.20.205.0.jar
-rw-rw-r-- 1 root root 142465 2011-10-07 14:19 hadoop-examples-0.20.205.0.jar
-rw-rw-r-- 1 root root 2487116 2011-10-07 14:24 hadoop-test-0.20.205.0.jar
-rw-rw-r-- 1 root root 287776 2011-10-07 14:19 hadoop-tools-0.20.205.0.jar
drwxr-xr-x 3 root root 4096 2011-10-07 14:20 include
drwxr-xr-x 2 root root 4096 2011-11-22 14:28 ivy
-rw-rw-r-- 1 root root 10389 2011-10-07 14:19 ivy.xml
drwxr-xr-x 6 root root 4096 2011-11-22 14:28 lib
drwxr-xr-x 2 root root 4096 2011-11-22 14:28 libexec
-rw-rw-r-- 1 root root 13366 2011-10-07 14:19 LICENSE.txt
drwxr-xr-x 4 root root 4096 2011-12-07 12:10 logs
-rw-rw-r-- 1 root root 101 2011-10-07 14:19 NOTICE.txt
drwxr-xr-x 4 root root 4096 2011-11-29 10:36 out
-rw-rw-r-- 1 root root 1366 2011-10-07 14:19 README.txt
drwxr-xr-x 2 root root 4096 2011-11-22 14:28 sbin
drwxr-xr-x 4 root root 4096 2011-10-07 14:20 share
drwxr-xr-x 9 root root 4096 2011-10-07 14:19 webapps
Navigate to the conf directory and edit the core-site.xml file. The default file should look like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
</configuration>
Now we'll have to add 2 properties to make this a clustered setup:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://node616:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hdfs/cache</value>
</property>
</configuration>
- fs.default.name ==> Sets the default file system name. Since we're setting up a clustered environment, we'll set this to point to the namenode hostname and port; which in this case is the current machine.
- hadoop.tmp.dir ==> A base for other temporary directories. Points to /tmp by default. But I had a problem with that as the Linux /tmp mount point is usually very small and caused problems. The following exception was thrown if I did not explicitly set this property:
java.io.IOException: File /user/root/testfile could only be replicated to 0 nodes, instead of 1
For more properties, please consult the following URL: http://hadoop.apache.org/common/docs/current/core-default.html
Next comes the
hdfs-site.xml file which we'll customize it like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/opt/hdfs/data</value>
</property>
</configuration>
- dfs.replication ==> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. Since we only have one node, we'll set it to 1 for the time being.
- dfs.name.dir ==> Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
- dfs.data.dir ==> Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
More configuration parameters here: http://hadoop.apache.org/common/docs/current/hdfs-default.html
Lastly, we come to the MapReduce site configuration file; mapred-site.xml. Output below shows the updated version:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>node616:9001</value>
</property>
</configuration>
- mapred.job.tracker ==> Node specific port which job tracker process is running on.
Edit the masters file and change the localhost value to node616. Ditto slaves file. By default, the host is set to localhost in both files. However, since we're using proper host names, it's better to update the entry so that all master and slaves nodes can use.
One last thing before starting up the service is to initialize the HDFS namenode directory. Execute the following command:
node616:/opt/hadoop/bin$ ./hadoop namenode -format
Everything should be configured correctly :) We can run Hadoop by going into the bin directory:
node616:/opt/hadoop/bin$ ./start-all.sh
starting namenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-namenode-node616.outWarning: $HADOOP_HOME is deprecated.
node616: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:node616: starting secondarynamenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-secondarynamenode-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:starting jobtracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-jobtracker-node616.outWarning: $HADOOP_HOME is deprecated.
node616: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:
A quick check via ps:
hadoop 29004 28217 0 09:31 pts/0 00:00:07 /usr/bin/java -Dproc_jar -Xmx256m -Dhadoop.log.dir=/opt/hadoop/libexec/../logs -Dhadoop.log.file=hadoop.log -Dhadoop.hom
hadoop 30630 1 1 16:07 pts/0 00:00:02 /usr/bin/java -Dproc_namenode -Xmx1000m -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=false -Dc
hadoop 30743 1 3 16:07 ? 00:00:04 /usr/bin/java -Dproc_datanode -Xmx1000m -server -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=f
hadoop 30858 1 1 16:07 ? 00:00:01 /usr/bin/java -Dproc_secondarynamenode -Xmx1000m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.
hadoop 30940 1 2 16:07 pts/0 00:00:02 /usr/bin/java -Dproc_jobtracker -Xmx1000m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.log.dir
hadoop 31048 1 2 16:07 ? 00:00:03 /usr/bin/java -Dproc_tasktracker -Xmx1000m -Dhadoop.log.dir=/opt/hadoop/libexec/../logs -Dhadoop.log.file=hadoop-hadoop-ta
Now that we can see all processes are running, go ahead and visit the following URLs:
- http://node616:50030 ==> Map/Reduce admin
- http://node616:50070 ==> NameNode admin
Let's try copying some files over to the HDFS:
node616:/opt/hadoop/bin$ ./hadoop fs -copyFromLocal 20m.log .
And let's see if it's there:
node616:~$ hadoop fs -ls
Found 1 items
-rw-r--r-- 3 hadoop supergroup 5840878894 2011-11-29 09:21 /user/hadoop/20m.log
So far so good :)
Once you're done, shutdown the Hadoop processes by executing stop-all.sh:
node616:/opt/hadoop/bin# ./stop-all.sh
stopping jobtracker
node616: stopping tasktracker
stopping namenode
node616: stopping datanode
node616: stopping secondarynamenode
Data nodes/slaves
Now that the namenode is up, we can proceed to setup our slaves.
Since we know that we'll have an additional two servers, we can add those entries in to the conf/slaves file:
node616
node617
node618
If there's a need to add more in the future, the slaves nodes can be added dynamically.
Edit the
hdfs-site.xml file and change the dfs.replication value from 1 to 3. This ensures that the data blocks are replicated to 3 nodes (which is actually the default value).
Next, tar the entire Hadoop directory in the Namenode by executing the following command:
node616:/opt$ tar czvf hadoop.tar.gz hadoop
Transfer the tarball to the other servers (i.e. node617 and node618) and untar it. Make sure the /opt/hdfs directories have been created. Once the package has been extracted, go back to the namenode (node616) and execute the start-all.sh script. It should output the following:
node616:/opt/hadoop/bin# ./start-all.sh
starting namenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-namenode-node616.out
Warning: $HADOOP_HOME is deprecated.
node617: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node617.out
node616: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node616.out
node618: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node618.out
node617: Warning: $HADOOP_HOME is deprecated.
node617:
node616: Warning: $HADOOP_HOME is deprecated.
node616:
node618: Warning: $HADOOP_HOME is deprecated.
node618:
node616: starting secondarynamenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-secondarynamenode-node616.out
node616: Warning: $HADOOP_HOME is deprecated.
node616:
starting jobtracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-jobtracker-node616.out
Warning: $HADOOP_HOME is deprecated.
node618: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node618.out
node617: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node617.out
node616: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node616.out
node618: Warning: $HADOOP_HOME is deprecated.
node618:
node617: Warning: $HADOOP_HOME is deprecated.
node617:
node616: Warning: $HADOOP_HOME is deprecated.
node616:
Notice that the script will remotely start the data and task tracker services in the slave nodes. Visit the NameNode admin at http://node616:50070 to confirm the number of live nodes in the cluster.
Stopping/Starting Services in a node
To stop or start specific a specific service in just one node, use the bin/hadoop-daemon.sh script. As an example, to stop the datanode and the tasktracker processes in node618, I'll do:
/opt/hadoop/bin/hadoop-daemon.sh stop datanode
/opt/hadoop/bin/hadoop-daemon.sh stop tasktracker
To start them up, simply substitute "stop" with "start" in the commands above.
References