Thursday, December 8, 2011

Setting up Hadoop in clustered mode in Ubuntu

Overview

This entry details the steps I took to setup Hadoop in a clustered setup in Ubuntu 11.10.  Hadoop version 0.20.205.0 was used to setup the environment.  The Hadoop cluster consists of 3 servers/nodes:
  • node616 ==> namenode, tasktracker, datanode, jobtracker, secondarynamenode
  • node617 ==> datanode, jobtracker
  • node618 ==> datanode, jobtracker
In an actual production setup, thet namenode shouldn't act as datanode, jobtracker and secondarynamenode.  But for the purpose of this setup, things will be simplified :)


Server setup

Ensure that the /etc/hosts file in all servers are updated properly.  All my servers have the following entry:
192.168.1.1    node616
192.168.1.2    node617
192.168.1.3    node618
This is to ensure that the configuration files stay the same in all servers.

The following directories must be created beforehand to store Hadoop related data:

  • /opt/hdfs/cache    ==> HDFS cache storage
  • /opt/hdfs/data    ==> HDFS data node storage
  • /opt/hdfs/name    ==> HDFS name node storage


SSH setup

Before we proceed to actual setup, the user running Hadoop must be able to ssh to the servers without a passphrase.  Test this out by issuing the following command:
$ ssh node616
If it prompts for a password, execute the following commands:
$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
The public key needs to be copied to all data nodes/slaves once they're setup in a later stage.


Namenode setup

Obtain Hadoop binary distribution from the main site (http://hadoop.apache.org).  Place it to a location in the server.  I've used /opt/hadoop for all installations. 

The extracted directory contents should look like the output below:
one616:/opt/hadoop# ls -l
total 7144
drwxr-xr-x  2 root root    4096 2011-11-25 16:58 bin
-rw-rw-r--  1 root root  112062 2011-10-07 14:19 build.xml
drwxr-xr-x  4 root root    4096 2011-10-07 14:24 c++
-rw-rw-r--  1 root root  433928 2011-10-07 14:19 CHANGES.txt
drwxr-xr-x  2 root root    4096 2011-11-30 12:23 conf
drwxr-xr-x 11 root root    4096 2011-10-07 14:19 contrib
drwxr-xr-x  3 root root    4096 2011-10-07 14:20 etc
-rw-rw-r--  1 root root    6839 2011-10-07 14:19 hadoop-ant-0.20.205.0.jar
-rw-rw-r--  1 root root 3700955 2011-10-07 14:24 hadoop-core-0.20.205.0.jar
-rw-rw-r--  1 root root  142465 2011-10-07 14:19 hadoop-examples-0.20.205.0.jar
-rw-rw-r--  1 root root 2487116 2011-10-07 14:24 hadoop-test-0.20.205.0.jar
-rw-rw-r--  1 root root  287776 2011-10-07 14:19 hadoop-tools-0.20.205.0.jar
drwxr-xr-x  3 root root    4096 2011-10-07 14:20 include
drwxr-xr-x  2 root root    4096 2011-11-22 14:28 ivy
-rw-rw-r--  1 root root   10389 2011-10-07 14:19 ivy.xml
drwxr-xr-x  6 root root    4096 2011-11-22 14:28 lib
drwxr-xr-x  2 root root    4096 2011-11-22 14:28 libexec
-rw-rw-r--  1 root root   13366 2011-10-07 14:19 LICENSE.txt
drwxr-xr-x  4 root root    4096 2011-12-07 12:10 logs
-rw-rw-r--  1 root root     101 2011-10-07 14:19 NOTICE.txt
drwxr-xr-x  4 root root    4096 2011-11-29 10:36 out
-rw-rw-r--  1 root root    1366 2011-10-07 14:19 README.txt
drwxr-xr-x  2 root root    4096 2011-11-22 14:28 sbin
drwxr-xr-x  4 root root    4096 2011-10-07 14:20 share
drwxr-xr-x  9 root root    4096 2011-10-07 14:19 webapps
Navigate to the conf directory and edit the core-site.xml file.  The default file should look like the following:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
</configuration>
Now we'll have to add 2 properties to make this a clustered setup: 
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://node616:9000</value>
     </property>

     <property>
         <name>hadoop.tmp.dir</name>
         <value>/opt/hdfs/cache</value>
     </property>
</configuration>
  • fs.default.name ==> Sets the default file system name.  Since we're setting up a clustered environment, we'll set this to point to the namenode hostname and port; which in this case is the current machine.
  • hadoop.tmp.dir ==> A base for other temporary directories.  Points to /tmp by default.  But I had a problem with that as the Linux /tmp mount point is usually very small and caused problems.  The following exception was thrown if I did not explicitly set this property:
java.io.IOException: File /user/root/testfile could only be replicated to 0 nodes, instead of 1
For more properties, please consult the following URL: http://hadoop.apache.org/common/docs/current/core-default.html
   
Next comes the hdfs-site.xml file which we'll customize it like the following:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
        <property>
                <name>dfs.name.dir</name>
                <value>/opt/hdfs/name</value>
        </property>
        <property>
                <name>dfs.data.dir</name>
                <value>/opt/hdfs/data</value>
        </property>
</configuration>
  • dfs.replication ==> Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.  Since we only have one node, we'll set it to 1 for the time being.
  • dfs.name.dir ==>     Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
  • dfs.data.dir ==> Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.
More configuration parameters here: http://hadoop.apache.org/common/docs/current/hdfs-default.html

Lastly, we come to the MapReduce site configuration file; mapred-site.xml.  Output below shows the updated version:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>node616:9001</value>
     </property>
</configuration>
  • mapred.job.tracker ==> Node specific port which job tracker process is running on.
Edit the masters file and change the localhost value to node616.  Ditto slaves file.  By default, the host is set to localhost in both files.  However, since we're using proper host names, it's better to update the entry so that all master and slaves nodes can use.

One last thing before starting up the service is to initialize the HDFS namenode directory.  Execute the following command:
node616:/opt/hadoop/bin$ ./hadoop namenode -format
Everything should be configured correctly :)  We can run Hadoop by going into the bin directory:
node616:/opt/hadoop/bin$ ./start-all.sh
starting namenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-namenode-node616.outWarning: $HADOOP_HOME is deprecated.
node616: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:node616: starting secondarynamenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-secondarynamenode-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:starting jobtracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-jobtracker-node616.outWarning: $HADOOP_HOME is deprecated.
node616: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node616.outnode616: Warning: $HADOOP_HOME is deprecated.node616:
A quick check via ps:
hadoop     29004 28217  0 09:31 pts/0    00:00:07 /usr/bin/java -Dproc_jar -Xmx256m -Dhadoop.log.dir=/opt/hadoop/libexec/../logs -Dhadoop.log.file=hadoop.log -Dhadoop.hom
hadoop     30630     1  1 16:07 pts/0    00:00:02 /usr/bin/java -Dproc_namenode -Xmx1000m -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=false -Dc
hadoop     30743     1  3 16:07 ?        00:00:04 /usr/bin/java -Dproc_datanode -Xmx1000m -server -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=f
hadoop     30858     1  1 16:07 ?        00:00:01 /usr/bin/java -Dproc_secondarynamenode -Xmx1000m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.
hadoop     30940     1  2 16:07 pts/0    00:00:02 /usr/bin/java -Dproc_jobtracker -Xmx1000m -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote -Dhadoop.log.dir
hadoop     31048     1  2 16:07 ?        00:00:03 /usr/bin/java -Dproc_tasktracker -Xmx1000m -Dhadoop.log.dir=/opt/hadoop/libexec/../logs -Dhadoop.log.file=hadoop-hadoop-ta
Now that we can see all processes are running, go ahead and visit the following URLs:
  • http://node616:50030 ==> Map/Reduce admin
  • http://node616:50070 ==> NameNode admin   
Let's try copying some files over to the HDFS:
node616:/opt/hadoop/bin$ ./hadoop fs -copyFromLocal 20m.log .
And let's see if it's there:
node616:~$ hadoop fs -ls
Found 1 items
-rw-r--r--   3 hadoop supergroup 5840878894 2011-11-29 09:21 /user/hadoop/20m.log
So far so good :)

Once you're done, shutdown the Hadoop processes by executing stop-all.sh:
node616:/opt/hadoop/bin# ./stop-all.sh
stopping jobtracker
node616: stopping tasktracker
stopping namenode
node616: stopping datanode
node616: stopping secondarynamenode


Data nodes/slaves

Now that the namenode is up, we can proceed to setup our slaves. 

Since we know that we'll have an additional two servers, we can add those entries in to the conf/slaves file:
node616
node617
node618
If there's a need to add more in the future, the slaves nodes can be added dynamically.

Edit the hdfs-site.xml file and change the dfs.replication value from 1 to 3.  This ensures that the data blocks are replicated to 3 nodes (which is actually the default value).

Next, tar the entire Hadoop directory in the Namenode by executing the following command:
node616:/opt$ tar czvf hadoop.tar.gz hadoop
Transfer the tarball to the other servers (i.e. node617 and node618) and untar it.  Make sure the /opt/hdfs directories have been created.  Once the package has been extracted, go back to the namenode (node616) and execute the start-all.sh script.  It should output the following:
node616:/opt/hadoop/bin# ./start-all.sh
starting namenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-namenode-node616.out
Warning: $HADOOP_HOME is deprecated.
node617: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node617.out
node616: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node616.out
node618: starting datanode, logging to /opt/hadoop/libexec/../logs/hadoop-root-datanode-node618.out
node617: Warning: $HADOOP_HOME is deprecated.
node617:
node616: Warning: $HADOOP_HOME is deprecated.
node616:
node618: Warning: $HADOOP_HOME is deprecated.
node618:
node616: starting secondarynamenode, logging to /opt/hadoop/libexec/../logs/hadoop-root-secondarynamenode-node616.out
node616: Warning: $HADOOP_HOME is deprecated.
node616:
starting jobtracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-jobtracker-node616.out
Warning: $HADOOP_HOME is deprecated.
node618: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node618.out
node617: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node617.out
node616: starting tasktracker, logging to /opt/hadoop/libexec/../logs/hadoop-root-tasktracker-node616.out
node618: Warning: $HADOOP_HOME is deprecated.
node618:
node617: Warning: $HADOOP_HOME is deprecated.
node617:
node616: Warning: $HADOOP_HOME is deprecated.
node616:
Notice that the script will remotely start the data and task tracker services in the slave nodes.  Visit the NameNode admin at http://node616:50070 to confirm the number of live nodes in the cluster.


Stopping/Starting Services in a node

To stop or start specific a specific service in just one node, use the bin/hadoop-daemon.sh script.  As an example, to stop the datanode and the tasktracker processes in node618, I'll do:

/opt/hadoop/bin/hadoop-daemon.sh stop datanode
/opt/hadoop/bin/hadoop-daemon.sh stop tasktracker
To start them up, simply substitute "stop" with "start" in the commands above.


References




79 comments:

  1. This is very cool. So .... I'm wondering why you made the name node a data node when you already had 2 data nodes ?

    ReplyDelete
  2. Nicely explained, also refer this video

    http://www.youtube.com/watch?v=PFKs592W_Rc

    ReplyDelete
  3. Good one bro. Well I too have setup a few single & multi node Hadoop 1.2.1 cluster on both Ubuntu 14.04 and solaris 11 operating systems. follow link http://hashprompt.blogspot.in/search/label/Hadoop

    ReplyDelete
  4. This is one of the most incredible blogs on hadoop Ive read in a very long time. The amount of information in here is stunning, like you practically wrote the book on the subject. Your blog is great for anyone who wants to understand this subject more. Great stuff; please keep it up!
    Hadoop Training in hyderabad

    ReplyDelete
  5. Checkout the fedora and bigtop packaging + the puppet recipes in bigtop for a deeper view of how to package and install yarn

    ReplyDelete
    Replies
    1. But yes, I agree a very excellent resource for Hadoop mr1

      Delete
  6. Nice piece of article you have shared here, my dream of becoming a hadoop professional become true with the help of Hadoop training institutes in chennai, keep up your good work of sharing quality articles.

    ReplyDelete
  7. Best Big Data Hadoop Training in Hyderabad @ Kalyan Orienit

    Follow the below links to know more knowledge on Hadoop

    WebSites:
    ================
    http://www.kalyanhadooptraining.com/

    http://www.hyderabadhadooptraining.com/

    http://www.bigdatatraininghyderabad.com/

    Videos:
    ===============
    https://www.youtube.com/watch?v=-_fTzrgzVQc

    https://www.youtube.com/watch?v=Df2Odze87dE

    https://www.youtube.com/watch?v=AOfX-tNkYyo

    https://www.youtube.com/watch?v=Cyo3y0vlZ3c

    https://www.youtube.com/watch?v=jOLSXx6koO4

    https://www.youtube.com/watch?v=09mpbNBAmCo

    ReplyDelete
  8. Your posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post. AWS Training in chennai | AWS Training chennai | AWS course in chennai

    ReplyDelete
  9. Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing. Vmware certification in chennai | Vmware certification chennai

    ReplyDelete
  10. Thanks for sharing this valuable information to our vision. You have posted a trust worthy blog keep sharing.
    Regards,
    Informatica training in chennai|Informatica courses in Chennai

    ReplyDelete
  11. Excellent Post, I welcome your interest about to post blogs. It will help many of them to update their skills in their interesting field.
    Regards,
    sas training in Chennai|sas course in Chennai|sas training institute in Chennai

    ReplyDelete
  12. I read this massive useful Hadoop blogs. Very useful long time for Hadoop learning member.I recommended your blog post very helpful to us. In this blog article, we will learn how to set up a multi-node Hadoop cluster on Ubuntu 16.04. A Hadoop cluster which has more than 1 data node is a multi-node Hadoop cluster, hence, the goal of this tutorial is to get 2 data nodes up and running.If want to do learning from Selenium automation testing to reach us Besant technologies.They Provide at real-time Selenium Automation Testing.
    Selenium Training in Chennai
    Selenium Training in Chennai

    ReplyDelete
  13. This comment has been removed by the author.

    ReplyDelete
  14. It is a great post. Keep sharing such kind of noteworthy information.

    Spark Training in Chennai | Spark Training Academy Chennai

    ReplyDelete
  15. The knowledge of technology you have been sharing thorough this post is very much helpful to develop new idea. here by i also want to share this.
    Devops Training in pune|Devops training in tambaram|Devops training in velachery|Devops training in annanagar
    DevOps online Training

    ReplyDelete
  16. Your good knowledge and kindness in playing with all the pieces were very useful. I don’t know what I would have done if I had not encountered such a step like this.

    rpa training in Chennai | rpa training in pune

    rpa training in tambaram | rpa training in sholinganallur

    rpa training in Chennai | rpa training in velachery

    rpa online training | rpa training in bangalore

    ReplyDelete
  17. I wanted to thank you for this great read!! I definitely enjoying every little bit of it I have you bookmarked to check out new stuff you post.is article.
    python training in chennai
    python training in chennai
    python training in Bangalore

    ReplyDelete
  18. Wonderful article, very useful and well explanation. Your post is extremely incredible. I will refer this to my candidates...
    java training in chennai | java training in bangalore


    java training in tambaram | java training in velachery

    ReplyDelete
  19. Hi there I am so thrilled I found your website, I really found you by mistake, while I was browsing on Yahoo for something else, Anyhow I am here now and would just like to say thanks a lot for a tremendous post
    safety courses in chennai

    ReplyDelete
  20. Outstanding blog thanks for sharing such wonderful blog with us ,after long time came across such knowlegeble blog. keep sharing such informative blog with us.
    Air Hostess Training in Chennai | Air Hostess Training Institute in Chennai | Air Hostess Academy in Chennai | Air Hostess Course in Chennai | Air Hostess Institute in Chennai

    ReplyDelete
  21. I think this is a great site to post and I have read most of contents and I found it useful for my Career .Thanks for the useful information. Good work.Keep going.best mobile service center in chennai
    mobile service center in velacherry
    mobile service center in vadapalani

    ReplyDelete

  22. i just go through your article it’s very interesting time just pass away by reading your article looking for more updates. Thank you for sharing
    Best DevOps Training Institute

    ReplyDelete
  23. Very Good Blog. Highly valuable information have been shared. Highly useful blog..Great information has been shared. We expect many more blogs from the author. Special thanks for sharing..
    SAP Training in Chennai | AWS Training in Chennai | Android Training in Chennai | Selenium Training in Chennai | Networking Training in Chennai

    ReplyDelete
  24. Hi,
    Good job & thank you very much for the new information, i learned something new. Very well written. It was sooo good to read and usefull to improve knowledge. Who want to learn this information most helpful. One who wanted to learn this technology IT employees will always suggest you take Big Data Hadoop Online Training Courses.

    ReplyDelete
  25. I'm very happy to search out this information processing system. I would like to thank you for this fantastic read!!
    GCP Training
    Google Cloud Platform Training In Hyderabad

    ReplyDelete
  26. Thanks a lot for giving great kind of information. So useful and practical for me. Excellent blog and very informative, nice work keep updating. If you are looking for any Big data related information, check our bigdata training institute in bangalore web page. Thanks a lot.

    ReplyDelete
  27. Yes i am totally agreed with this article and i just want say that this article is very nice and very informative article.I will make sure to be reading your blog more. You made a good point but I can't help but wonder, what about the other side? !!!!!!Thanks

    Oracle Training | Online Course | Certification in chennai | Oracle Training | Online Course | Certification in bangalore | Oracle Training | Online Course | Certification in hyderabad | Oracle Training | Online Course | Certification in pune | Oracle Training | Online Course | Certification in coimbatore

    ReplyDelete
  28. I am in fact Thankful to the owner of this site who has shred this impressive article.

    Java Training in Chennai

    Java Course in Chennai

    ReplyDelete
  29. Very Informative blog thank you for sharing. Keep sharing.

    Best software training institute in Chennai. Make your career development the best by learning software courses.

    power bi course in chennai
    blue prism training institute in chennai
    rpa training in chennai

    ReplyDelete
  30. I concentrate on a couple of reward stuff from it as well, gratitude for sharing your suggest. Microsoft Office 2010 crack is an effective software for all MS office and any version of MS Windows products. MS Office 2010 Crack

    ReplyDelete
  31. Deep Dark Quotes concerning Life may be practiced at varied times in one’s life, like when a breakup, a business loss, A natural catastrophe.powerful deep dark love quotes

    ReplyDelete