Top Menu

Tuesday, November 17, 2015

How to Setup Hadoop on CentOS/RHEL 6/7

Configuring a working Hadoop 2.6.0 environment on CentOS 7 is a bit of a struggle. Here are the steps we made to set everything up so that we have a working hadoop cluster. 
Basic OS setup
Let’s assume that we have a fresh CentOS/RHEL install. On each node:
1.       Edit /etc/hosts

[root@hmaster ~]# vi /etc/hosts

2.       Add the following lines (change IP addresses accordingly):

192.168.40.142   hmaster
192.168.40.143   hslave1
192.168.40.144   hslave2
 192.168.40.145   hslave3

3.       Create user hadoop

[root@hmaster ~]# useradd hadoop
[root@hmaster ~]#  passwd hadoop
4.       Set up key-based (passwordless) login:

# su - hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hmaster
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hslave1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hslave2
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@hslave3
$ chmod 0600 ~/.ssh/authorized_keys

This will be useful when we’d like to start all necessary hadoop services on all the slave nodes.

5.       Installing Oracle Java SDK

[root@hmaster ~]# yum install java
6.       Set up environmental variables for “hadoop” user

[root@hmaster ~]# vi /home/hadoop/.bashrc
7.       Add the following lines to /home/hadoop/.bashrc on all the nodes (you may play with scp for that too):

export JAVA_HOME=/usr/lib/jvm/jre
export HADOOP_PREFIX=/opt/hadoop
export HADOOP_HOME=$HADOOP_PREFIX
export HADOOP_COMMON_HOME=$HADOOP_PREFIX
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HADOOP_HDFS_HOME=$HADOOP_PREFIX
export HADOOP_MAPRED_HOME=$HADOOP_PREFIX
export HADOOP_YARN_HOME=$HADOOP_PREFIX
export PATH=$PATH:$HADOOP_PREFIX/sbin:$HADOOP_PREFIX/bin

Installing and configuring hadoop 2.6.0
8.       On master: download Hadoop and extract the package

# cd /opt
# wget http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
# tar -zxf hadoop-2.6.0.tar.gz
# rm hadoop-2.6.0.tar.gz
# mv hadoop-2.6.0 hadoop
9.       Propagate /opt/hadoop to slave nodes:

# scp -r hadoop hslave1:/opt
# scp -r hadoop hslave2:/opt
# scp -r hadoop hslave3:/opt
10.   Edit /opt/hadoop/etc/hadoop/core-site.xml – set up NameNode URI on every node:

    fs.defaultFS
    hdfs://hmaster:9000/


11.   Create HDFS DataNode data dirs on every node and change ownership of /opt/hadoop:

# chown -R hadoop:hadoop /opt/hadoop/
# mkdir /home/hadoop/datanode
# chown -R hadoop:hadoop /home/hadoop/datanode/
12.   Edit /opt/hadoop/etc/hadoop/hdfs-site.xml – set up DataNodes:

  dfs.replication
  3

  dfs.permissions
  false

   dfs.datanode.data.dir
   /home/hadoop/datanode


13.   Create HDFS NameNode data dirs on master:

# mkdir /home/hadoop/namenode
# chown -R hadoop:hadoop /home/hadoop/namenode/
14.   Edit /opt/hadoop/etc/hadoop/hdfs-site.xml on master. Add further properties:

        dfs.namenode.data.dir
        /home/hadoop/namenode

15.   Edit /opt/hadoop/etc/hadoop/mapred-site.xml on master.


  mapreduce.framework.name
   yarn




16.   Edit /opt/hadoop/etc/hadoop/yarn-site.xml – setup ResourceManager and NodeManagers:

        yarn.resourcemanager.hostname
        hmaster

        yarn.nodemanager.hostname
        hmaster

  yarn.nodemanager.aux-services
    mapreduce_shuffle


17.   Edit /opt/hadoop/etc/hadoop/slaves on master (so that master may start all necessary services on slaves automagically):

hmaster
hslave1
hslave2
hslave3
18.   Now the important step: disable firewall and IPv6 (Hadoop does not support IPv6 – problems with listening on all the interfaces via 0.0.0.0):

#service iptables stop
#chkconfig iptables off
19.   Add the following lines to /etc/sysctl.conf:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
20.   Format NameNode:

# su - hadoop
$ hdfs namenode -format
21.   Start HDFS (as user hadoop):

$ start-dfs.sh
Check out with jps if DataNode are running on slaves and if DataNode, NameNode, and SecondaryNameNode are running on master. Also try accessing http://hmaster:50070/

22.   Start YARN on master:

$ start-yarn.sh

23.   Test Hadoop Services in Browser

Hadoop NameNode started on port 50070 default. Access your server on port 50070 in your favorite web browser.