Thursday, December 8, 2011

Setting up Hadoop in clustered mode in Ubuntu

Overview

This entry details the steps I took to setup Hadoop in a clustered setup in Ubuntu 11.10.  Hadoop version 0.20.205.0 was used to setup the environment.  The Hadoop cluster consists of 3 servers/nodes:
  • node616 ==> namenode, tasktracker, datanode, jobtracker, secondarynamenode
  • node617 ==> datanode, jobtracker
  • node618 ==> datanode, jobtracker
In an actual production setup, thet namenode shouldn't act as datanode, jobtracker and secondarynamenode.  But for the purpose of this setup, things will be simplified :)


Server setup

Ensure that the /etc/hosts file in all servers are updated properly.  All my servers have the following entry:
192.168.1.1    node616
192.168.1.2    node617
192.168.1.3    node618
This is to ensure that the configuration files stay the same in all servers.

The following directories must be created beforehand to store Hadoop related data:

  • /opt/hdfs/cache    ==> HDFS cache storage
  • /opt/hdfs/data    ==> HDFS data node storage
  • /opt/hdfs/name    ==> HDFS name node storage


SSH setup

Before we proceed to actual setup, the user running Hadoop must be able to ssh to the servers without a passphrase.  Test this out by issuing the following command:
$ ssh node616
If it prompts for a password, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
The public key needs to be copied to all data nodes/slaves once they're setup in a later stage.


Namenode setup

Obtain Hadoop binary distribution from the main site (http://hadoop.apache.org).  Place it to a location in the server.  I've used /opt/hadoop for all installations. 

The extracted directory contents should look like the output below:
one616:/opt/hadoop# ls -l
total 7144
drwxr-xr-x  2 root root    4096 2011-11-25 16:58 bin
-rw-rw-r--  1 root root  112062 2011-10-07 14:19 build.xml
drwxr-xr-x  4 root root    4096 2011-10-07 14:24 c++
-rw-rw-r--  1 root root  433928 2011-10-07 14:19 CHANGES.txt
drwxr-xr-x  2 root root    4096 2011-11-30 12:23 conf
drwxr-xr-x 11 root root    4096 2011-10-07 14:19 contrib
drwxr-xr-x  3 root root    4096 2011-10-07 14:20 etc
-rw-rw-r--  1 root root    6839 2011-10-07 14:19 hadoop-ant-0.20.205.0.jar
-rw-rw-r--  1 root root 3700955 2011-10-07 14:24 hadoop-core-0.20.205.0.jar
-rw-rw-r--  1 root root  142465 2011-10-07 14:19 hadoop-examples-0.20.205.0.jar
-rw-rw-r--  1 root root 2487116 2011-10-07 14:24 hadoop-test-0.20.205.0.jar
-rw-rw-r--  1 root root  287776 2011-10-07 14:19 hadoop-tools-0.20.205.0.jar
drwxr-xr-x  3 root root    4096 2011-10-07 14:20 include
drwxr-xr-x  2 root root    4096 2011-11-22 14:28 ivy
-rw-rw-r--  1 root root   10389 2011-10-07 14:19 ivy.xml
drwxr-xr-x  6 root root    4096 2011-11-22 14:28 lib
drwxr-xr-x  2 root root    4096 2011-11-22 14:28 libexec
-rw-rw-r--  1 root root   13366 2011-10-07 14:19 LICENSE.txt
drwxr-xr-x  4 root root    4096 2011-12-07 12:10 logs
-rw-rw-r--  1 root root     101 2011-10-07 14:19 NOTICE.txt
drwxr-xr-x  4 root root    4096 2011-11-29 10:36 out
-rw-rw-r--  1 root root    1366 2011-10-07 14:19 README.txt
drwxr-xr-x  2 root root    4096 2011-11-22 14:28 sbin
drwxr-xr-x  4 root root    4096 2011-10-07 14:20 share
drwxr-xr-x  9 root root    4096 2011-10-07 14:19 webapps
Navigate to the conf directory and edit the core-site.xml file.  The default file should look like the following:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
</configuration>
Now we'll have to add 2 properties to make this a clustered setup: 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://node616:9000</value>
     </property>

     <property>
         <name>hadoop.tmp.dir</name>
         <value>/opt/hdfs/cache</value>
     </property>
</configuration>
  • fs.default.name ==> Sets the default file system name.  Since we're setting up a clustered environment, we'll set this to point to the namenode hostname and port; which in this case is the current machine.
  • hadoop.tmp.dir ==> A base for other temporary directories.  Points to /tmp by default.  But I had a problem with that as the Linux /tmp mount point is usually very small and caused problems.  The following exception was thrown if I did not explicitly set this property:
java.io.IOException: File /user/root/testfile could only be replicated to 0 nodes, instead of 1
For more properties, please consult the following URL: http://hadoop.apache.org/common/docs/current/core-default.html
   
Next comes the hdfs-site.xml file which we'll customize it like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
        <property>
                <name>dfs.name.dir</name>
                <value>/opt/hdfs/name</value>
        </property>
        <property>
                <name>dfs.data.dir</name>
                <value>/opt/hdfs/data</value>
        </property>
</configuration>
  • dfs.replication ==> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.  Since we only have one node, we'll set it to 1 for the time being.
  • dfs.name.dir ==>     Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
  • dfs.data.dir ==> Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
More configuration parameters here: http://hadoop.apache.org/common/docs/current/hdfs-default.html

Lastly, we come to the MapReduce site configuration file; mapred-site.xml.  Output below shows the updated version:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>node616:9001</value>
     </property>
</configuration>
  • mapred.job.tracker ==> Node specific port which job tracker process is running on.
Edit the masters file and change the localhost value to node616.  Ditto slaves file.  By default, the host is set to localhost in both files.  However, since we're using proper host names, it's better to update the entry so that all master and slaves nodes can use.

One last thing before starting up the service is to initialize the HDFS namenode directory.  Execute the following command:
node616:/opt/hadoop/bin$ ./hadoop namenode -format
Everything should be configured correctly :)  We can run Hadoop by going into the bin directory:
node616:/opt/hadoop/bin$ ./start-all.sh
starting namenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-namenode-node616.outWarning: $HADOOP_HOME is deprecated.
node616: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:node616: starting secondarynamenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-secondarynamenode-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:starting jobtracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-jobtracker-node616.outWarning: $HADOOP_HOME is deprecated.
node616: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:
A quick check via ps:
hadoop     29004 28217  0 09:31 pts/0    00:00:07 /usr/bin/java -Dproc_jar -Xmx256m -Dhadoop.log.dir=/opt/hadoop/libexec/../logs -Dhadoop.log.file=hadoop.log -Dhadoop.hom
hadoop     30630     1  1 16:07 pts/0    00:00:02 /usr/bin/java -Dproc_namenode -Xmx1000m -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=false -Dc
hadoop     30743     1  3 16:07 ?        00:00:04 /usr/bin/java -Dproc_datanode -Xmx1000m -server -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=f
hadoop     30858     1  1 16:07 ?        00:00:01 /usr/bin/java -Dproc_secondarynamenode -Xmx1000m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.
hadoop     30940     1  2 16:07 pts/0    00:00:02 /usr/bin/java -Dproc_jobtracker -Xmx1000m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.log.dir
hadoop     31048     1  2 16:07 ?        00:00:03 /usr/bin/java -Dproc_tasktracker -Xmx1000m -Dhadoop.log.dir=/opt/hadoop/libexec/../logs -Dhadoop.log.file=hadoop-hadoop-ta
Now that we can see all processes are running, go ahead and visit the following URLs:
  • http://node616:50030 ==> Map/Reduce admin
  • http://node616:50070 ==> NameNode admin   
Let's try copying some files over to the HDFS:
node616:/opt/hadoop/bin$ ./hadoop fs -copyFromLocal 20m.log .
And let's see if it's there:
node616:~$ hadoop fs -ls
Found 1 items
-rw-r--r--   3 hadoop supergroup 5840878894 2011-11-29 09:21 /user/hadoop/20m.log
So far so good :)

Once you're done, shutdown the Hadoop processes by executing stop-all.sh:
node616:/opt/hadoop/bin# ./stop-all.sh
stopping jobtracker
node616: stopping tasktracker
stopping namenode
node616: stopping datanode
node616: stopping secondarynamenode


Data nodes/slaves

Now that the namenode is up, we can proceed to setup our slaves. 

Since we know that we'll have an additional two servers, we can add those entries in to the conf/slaves file:
node616
node617
node618
If there's a need to add more in the future, the slaves nodes can be added dynamically.

Edit the hdfs-site.xml file and change the dfs.replication value from 1 to 3.  This ensures that the data blocks are replicated to 3 nodes (which is actually the default value).

Next, tar the entire Hadoop directory in the Namenode by executing the following command:
node616:/opt$ tar czvf hadoop.tar.gz hadoop
Transfer the tarball to the other servers (i.e. node617 and node618) and untar it.  Make sure the /opt/hdfs directories have been created.  Once the package has been extracted, go back to the namenode (node616) and execute the start-all.sh script.  It should output the following:
node616:/opt/hadoop/bin# ./start-all.sh
starting namenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-namenode-node616.out
Warning: $HADOOP_HOME is deprecated.
node617: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node617.out
node616: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node616.out
node618: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node618.out
node617: Warning: $HADOOP_HOME is deprecated.
node617:
node616: Warning: $HADOOP_HOME is deprecated.
node616:
node618: Warning: $HADOOP_HOME is deprecated.
node618:
node616: starting secondarynamenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-secondarynamenode-node616.out
node616: Warning: $HADOOP_HOME is deprecated.
node616:
starting jobtracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-jobtracker-node616.out
Warning: $HADOOP_HOME is deprecated.
node618: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node618.out
node617: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node617.out
node616: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node616.out
node618: Warning: $HADOOP_HOME is deprecated.
node618:
node617: Warning: $HADOOP_HOME is deprecated.
node617:
node616: Warning: $HADOOP_HOME is deprecated.
node616:
Notice that the script will remotely start the data and task tracker services in the slave nodes.  Visit the NameNode admin at http://node616:50070 to confirm the number of live nodes in the cluster.


Stopping/Starting Services in a node

To stop or start specific a specific service in just one node, use the bin/hadoop-daemon.sh script.  As an example, to stop the datanode and the tasktracker processes in node618, I'll do:

/opt/hadoop/bin/hadoop-daemon.sh stop datanode
/opt/hadoop/bin/hadoop-daemon.sh stop tasktracker
To start them up, simply substitute "stop" with "start" in the commands above.


References




29 comments:

Jay Vyas said...

This is very cool. So .... I'm wondering why you made the name node a data node when you already had 2 data nodes ?

Souri Ratnaparkhi said...

Nicely explained, also refer this video

http://www.youtube.com/watch?v=PFKs592W_Rc

Baban Gaigole said...

Good one bro. Well I too have setup a few single & multi node Hadoop 1.2.1 cluster on both Ubuntu 14.04 and solaris 11 operating systems. follow link http://hashprompt.blogspot.in/search/label/Hadoop

sundara rami reddy said...

This is one of the most incredible blogs on hadoop Ive read in a very long time. The amount of information in here is stunning, like you practically wrote the book on the subject. Your blog is great for anyone who wants to understand this subject more. Great stuff; please keep it up!
Hadoop Training in hyderabad

jay vyas said...

Checkout the fedora and bigtop packaging + the puppet recipes in bigtop for a deeper view of how to package and install yarn

jay vyas said...

But yes, I agree a very excellent resource for Hadoop mr1

Jhon Abraham said...

Nice piece of article you have shared here, my dream of becoming a hadoop professional become true with the help of Hadoop training institutes in chennai, keep up your good work of sharing quality articles.

Kalyan Hadoop said...

Best Big Data Hadoop Training in Hyderabad @ Kalyan Orienit

Follow the below links to know more knowledge on Hadoop

WebSites:
================
http://www.kalyanhadooptraining.com/

http://www.hyderabadhadooptraining.com/

http://www.bigdatatraininghyderabad.com/

Videos:
===============
https://www.youtube.com/watch?v=-_fTzrgzVQc

https://www.youtube.com/watch?v=Df2Odze87dE

https://www.youtube.com/watch?v=AOfX-tNkYyo

https://www.youtube.com/watch?v=Cyo3y0vlZ3c

https://www.youtube.com/watch?v=jOLSXx6koO4

https://www.youtube.com/watch?v=09mpbNBAmCo

dhanamlakshmi palu said...

Your posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post. AWS Training in chennai | AWS Training chennai | AWS course in chennai

surangacloud said...

Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for me.. cloud computing training in chennai | cloud computing training chennai | cloud computing course in chennai | cloud computing course chennai

dhanalakshmi palu said...

Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing. Vmware certification in chennai | Vmware certification chennai

Pooja Doss said...

Oracle DBA Training in Chennai
Thanks for sharing this informative blog. I did Oracle DBA Certification in Greens Technology at Adyar. This is really useful for me to make a bright career..

Pooja Doss said...

Whatever we gathered information from the blogs, we should implement that in practically then only we can understand that exact thing clearly, but it’s no need to do it, because you have explained the concepts very well. It was crystal clear, keep sharing..
Websphere Training in Chennai

Pooja Doss said...

Data warehousing Training in Chennai
I am reading your post from the beginning, it was so interesting to read & I feel thanks to you for posting such a good blog, keep updates regularly..

Pooja Doss said...

Selenium Training in Chennai
Wonderful blog.. Thanks for sharing informative blog.. its very useful to me..

Pooja Doss said...

Oracle Training in chennai
Thanks for sharing such a great information..Its really nice and informative..

Pooja Doss said...

I have read your blog and i got a very useful and knowledgeable information from your blog.You have done a great job.
SAP Training in Chennai

Pooja Doss said...

This information is impressive..I am inspired with your post writing style & how continuously you describe this topic. After reading your post,thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic
Android Training In Chennai In Chennai

Pooja Doss said...

Pretty article! I found some useful information in your blog, it was awesome to read,thanks for sharing this great content to my vision, keep sharing..
Unix Training In Chennai

Pooja Doss said...

I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing..
SalesForce Training in Chennai

Pooja Doss said...

There are lots of information about latest technology and how to get trained in them, like Best Hadoop Training In Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies Hadoop Training in Chennai By the way you are running a great blog. Thanks for sharing this blogs..

Sai Santosh said...

There are many blogs about the cloud and hadoop out there but this is completely different which has made me completeletely attached to this blog for the information on Hadoop subject. I only learned subject like this at hadoop online training center earlier. Thanks.

Melisa said...

Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.
Regards,
Informatica training in chennai|Informatica courses in Chennai

Dinju Thomas said...

This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..
Selenium Training in Chennai

Dinju Thomas said...

Thanks for Information Oracle Apps Technical is a collection of a bunch of collected applications like accounts payables, purchasing, inventory, accounts receivables, human resources, order management, general ledger and fixed assets, etc which have its own functionality for serving the business
Oracle Apps Training In Chennai

Dinju Thomas said...

Oracle Training in chennai | Oracle D2K Training In chennai
This information is impressive; I am inspired with your post writing style & how continuously you describe this topic. After reading your post, thanks for taking the time to discuss this, I feel happy about it and I love learning more about this topic..

caroline jesi said...

Excellent Post, I welcome your interest about to post blogs. It will help many of them to update their skills in their interesting field.
Regards,
sas training in Chennai|sas course in Chennai|sas training institute in Chennai

Vinoth Kumar said...

Wiztech Automation Solutions is the Best Training institute in Chennai,started in the year 2006 and it extended its circle through providing the best Education as per the Global Quality Standards. Hence our Training Center in Chennai was Recognized by IAO and ISO for its inspiring Education Quality Standards. Wiztech Automation Solution, the PLC SCADA Training Academy in Chennai offers both PLC, SCADA, DCS, VFD, Drives, Control Panels, HMI, Pneumatics, Embedded systems, VLSI, IT, Web Designing, AutoCad Training courses in chennai with latest various brands. Wiztech Automation Solutions offers Real Time Training Courses with 100% Placement support in chennai.

PLC Training in chennai
SCADA Training in chennai
PLC Training Institute in chennai
Embedded System Training in chennai
VLSI Training in chennai
Automation Training in chennai
Industrial Automation Training in chennai
Process Automation Training in chennai
DCS Training in chennai
Inplant Training in chennai
Placement
PLC Course in chennai
Best PLC Training in chennai
PLC Training in chennai
Robotics Training in chennai
Embedded Training in chennai
IT Training in chennai
Web designing Training in chennai
AutoCad Training in chennai

Vinoth Kumar said...

Welcome to Wiztech Automation - Embedded System Training in Chennai. We have knowledgeable Team for Embedded Courses handling and we also are after Job Placements offer provide once your Successful Completion of Course. We are Providing on Microcontrollers such as 8051, PIC, AVR, ARM7, ARM9, ARM11 and RTOS. Free Accommodation, Individual Focus, Best Lab facilities, 100% Practical Training and Job opportunities.

Embedded System Training in chennai
Embedded System Training Institute in chennai
Embedded Training in chennai
Embedded Course in chennai
Embedded Systems Course in chennai
Best Embedded System Training Institute in chennai
Best Embedded System Training Institutes in chennai
Embedded Training Institute in chennai
Embedded System Course in chennai
Best Embedded System Training in chennai
VLSI Training in chennai