This entry details the steps I took to setup Hadoop in a clustered setup in Ubuntu 11.10. Hadoop version 0.20.205.0 was used to setup the environment. The Hadoop cluster consists of 3 servers/nodes:
- node616 ==> namenode, tasktracker, datanode, jobtracker, secondarynamenode
- node617 ==> datanode, jobtracker
- node618 ==> datanode, jobtracker
Server setup
Ensure that the /etc/hosts file in all servers are updated properly. All my servers have the following entry:
192.168.1.1 node616This is to ensure that the configuration files stay the same in all servers.
192.168.1.2 node617
192.168.1.3 node618
The following directories must be created beforehand to store Hadoop related data:
- /opt/hdfs/cache ==> HDFS cache storage
- /opt/hdfs/data ==> HDFS data node storage
- /opt/hdfs/name ==> HDFS name node storage
SSH setup
Before we proceed to actual setup, the user running Hadoop must be able to ssh to the servers without a passphrase. Test this out by issuing the following command:
$ ssh node616If it prompts for a password, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsaThe public key needs to be copied to all data nodes/slaves once they're setup in a later stage.
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Namenode setup
Obtain Hadoop binary distribution from the main site (http://hadoop.apache.org). Place it to a location in the server. I've used /opt/hadoop for all installations.
The extracted directory contents should look like the output below:
one616:/opt/hadoop# ls -lNavigate to the conf directory and edit the core-site.xml file. The default file should look like the following:
total 7144
drwxr-xr-x 2 root root 4096 2011-11-25 16:58 bin
-rw-rw-r-- 1 root root 112062 2011-10-07 14:19 build.xml
drwxr-xr-x 4 root root 4096 2011-10-07 14:24 c++
-rw-rw-r-- 1 root root 433928 2011-10-07 14:19 CHANGES.txt
drwxr-xr-x 2 root root 4096 2011-11-30 12:23 conf
drwxr-xr-x 11 root root 4096 2011-10-07 14:19 contrib
drwxr-xr-x 3 root root 4096 2011-10-07 14:20 etc
-rw-rw-r-- 1 root root 6839 2011-10-07 14:19 hadoop-ant-0.20.205.0.jar
-rw-rw-r-- 1 root root 3700955 2011-10-07 14:24 hadoop-core-0.20.205.0.jar
-rw-rw-r-- 1 root root 142465 2011-10-07 14:19 hadoop-examples-0.20.205.0.jar
-rw-rw-r-- 1 root root 2487116 2011-10-07 14:24 hadoop-test-0.20.205.0.jar
-rw-rw-r-- 1 root root 287776 2011-10-07 14:19 hadoop-tools-0.20.205.0.jar
drwxr-xr-x 3 root root 4096 2011-10-07 14:20 include
drwxr-xr-x 2 root root 4096 2011-11-22 14:28 ivy
-rw-rw-r-- 1 root root 10389 2011-10-07 14:19 ivy.xml
drwxr-xr-x 6 root root 4096 2011-11-22 14:28 lib
drwxr-xr-x 2 root root 4096 2011-11-22 14:28 libexec
-rw-rw-r-- 1 root root 13366 2011-10-07 14:19 LICENSE.txt
drwxr-xr-x 4 root root 4096 2011-12-07 12:10 logs
-rw-rw-r-- 1 root root 101 2011-10-07 14:19 NOTICE.txt
drwxr-xr-x 4 root root 4096 2011-11-29 10:36 out
-rw-rw-r-- 1 root root 1366 2011-10-07 14:19 README.txt
drwxr-xr-x 2 root root 4096 2011-11-22 14:28 sbin
drwxr-xr-x 4 root root 4096 2011-10-07 14:20 share
drwxr-xr-x 9 root root 4096 2011-10-07 14:19 webapps
<?xml version="1.0"?>Now we'll have to add 2 properties to make this a clustered setup:
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
</configuration>
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>fs.default.name</name> <value>hdfs://node616:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/opt/hdfs/cache</value> </property> </configuration>
- fs.default.name ==> Sets the default file system name. Since we're setting up a clustered environment, we'll set this to point to the namenode hostname and port; which in this case is the current machine.
- hadoop.tmp.dir ==> A base for other temporary directories. Points to /tmp by default. But I had a problem with that as the Linux /tmp mount point is usually very small and caused problems. The following exception was thrown if I did not explicitly set this property:
java.io.IOException: File /user/root/testfile could only be replicated to 0 nodes, instead of 1For more properties, please consult the following URL: http://hadoop.apache.org/common/docs/current/core-default.html
Next comes the hdfs-site.xml file which we'll customize it like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/opt/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>/opt/hdfs/data</value>
</property>
</configuration>
- dfs.replication ==> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. Since we only have one node, we'll set it to 1 for the time being.
- dfs.name.dir ==> Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
- dfs.data.dir ==> Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
Lastly, we come to the MapReduce site configuration file; mapred-site.xml. Output below shows the updated version:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>node616:9001</value>
</property>
</configuration>
- mapred.job.tracker ==> Node specific port which job tracker process is running on.
One last thing before starting up the service is to initialize the HDFS namenode directory. Execute the following command:
node616:/opt/hadoop/bin$ ./hadoop namenode -formatEverything should be configured correctly :) We can run Hadoop by going into the bin directory:
node616:/opt/hadoop/bin$ ./start-all.shA quick check via ps:
starting namenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-namenode-node616.outWarning: $HADOOP_HOME is deprecated.
node616: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:node616: starting secondarynamenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-secondarynamenode-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:starting jobtracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-jobtracker-node616.outWarning: $HADOOP_HOME is deprecated.
node616: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:
hadoop 29004 28217 0 09:31 pts/0 00:00:07 /usr/bin/java -Dproc_jar -Xmx256m -Dhadoop.log.dir=/opt/hadoop/libexec/../logs -Dhadoop.log.file=hadoop.log -Dhadoop.homNow that we can see all processes are running, go ahead and visit the following URLs:
hadoop 30630 1 1 16:07 pts/0 00:00:02 /usr/bin/java -Dproc_namenode -Xmx1000m -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=false -Dc
hadoop 30743 1 3 16:07 ? 00:00:04 /usr/bin/java -Dproc_datanode -Xmx1000m -server -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=f
hadoop 30858 1 1 16:07 ? 00:00:01 /usr/bin/java -Dproc_secondarynamenode -Xmx1000m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.
hadoop 30940 1 2 16:07 pts/0 00:00:02 /usr/bin/java -Dproc_jobtracker -Xmx1000m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.log.dir
hadoop 31048 1 2 16:07 ? 00:00:03 /usr/bin/java -Dproc_tasktracker -Xmx1000m -Dhadoop.log.dir=/opt/hadoop/libexec/../logs -Dhadoop.log.file=hadoop-hadoop-ta
- http://node616:50030 ==> Map/Reduce admin
- http://node616:50070 ==> NameNode admin
node616:/opt/hadoop/bin$ ./hadoop fs -copyFromLocal 20m.log .And let's see if it's there:
node616:~$ hadoop fs -lsSo far so good :)
Found 1 items
-rw-r--r-- 3 hadoop supergroup 5840878894 2011-11-29 09:21 /user/hadoop/20m.log
Once you're done, shutdown the Hadoop processes by executing stop-all.sh:
node616:/opt/hadoop/bin# ./stop-all.sh
stopping jobtracker
node616: stopping tasktracker
stopping namenode
node616: stopping datanode
node616: stopping secondarynamenode
Data nodes/slaves
Now that the namenode is up, we can proceed to setup our slaves.
Since we know that we'll have an additional two servers, we can add those entries in to the conf/slaves file:
node616If there's a need to add more in the future, the slaves nodes can be added dynamically.
node617
node618
Edit the hdfs-site.xml file and change the dfs.replication value from 1 to 3. This ensures that the data blocks are replicated to 3 nodes (which is actually the default value).
Next, tar the entire Hadoop directory in the Namenode by executing the following command:
node616:/opt$ tar czvf hadoop.tar.gz hadoopTransfer the tarball to the other servers (i.e. node617 and node618) and untar it. Make sure the /opt/hdfs directories have been created. Once the package has been extracted, go back to the namenode (node616) and execute the start-all.sh script. It should output the following:
node616:/opt/hadoop/bin# ./start-all.shNotice that the script will remotely start the data and task tracker services in the slave nodes. Visit the NameNode admin at http://node616:50070 to confirm the number of live nodes in the cluster.
starting namenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-namenode-node616.out
Warning: $HADOOP_HOME is deprecated.
node617: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node617.out
node616: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node616.out
node618: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node618.out
node617: Warning: $HADOOP_HOME is deprecated.
node617:
node616: Warning: $HADOOP_HOME is deprecated.
node616:
node618: Warning: $HADOOP_HOME is deprecated.
node618:
node616: starting secondarynamenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-secondarynamenode-node616.out
node616: Warning: $HADOOP_HOME is deprecated.
node616:
starting jobtracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-jobtracker-node616.out
Warning: $HADOOP_HOME is deprecated.
node618: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node618.out
node617: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node617.out
node616: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node616.out
node618: Warning: $HADOOP_HOME is deprecated.
node618:
node617: Warning: $HADOOP_HOME is deprecated.
node617:
node616: Warning: $HADOOP_HOME is deprecated.
node616:
Stopping/Starting Services in a node
To stop or start specific a specific service in just one node, use the bin/hadoop-daemon.sh script. As an example, to stop the datanode and the tasktracker processes in node618, I'll do:
/opt/hadoop/bin/hadoop-daemon.sh stop datanodeTo start them up, simply substitute "stop" with "start" in the commands above.
/opt/hadoop/bin/hadoop-daemon.sh stop tasktracker
References
0 comments:
Post a Comment