Apache Hadoop Installation on Multi-Node Cluster

The objective of this tutorial is to describe step by step process to install Hadoop on a cluster of nodes. We have explained this example by using one master node and four slave nodes.

Platform

Operating System (OS). You can use Ubuntu 18.04.4 LTS version or later version, also you can use other flavors of Linux systems like Redhat, CentOS, etc.
Hadoop. We have used Apache Hadoop 3.1.2 version you can use Cloudera distribution or other distribution as well.

Download Software

VMWare Player for Windows

https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/7_0

Ubuntu

http://releases.ubuntu.com/18.04.4/ubuntu-18.04.4-desktop-amd64

Eclipse for windows

https://www.eclipse.org/downloads/

Putty for windows

http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

Winscp for windows

http://winscp.net/eng/download.php

Hadoop

https://archive.apache.org/dist/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz

We have the below cluster of nodes on which we will install Hadoop 3.1.2. We have one master and four slaves and their details are as mentioned below.

• HadoopMasternode: 185.150.1.20 (hadoopmaster) • HadoopSlavenode: 185.150.1.21 (hadoopslave-1) • HadoopSlavenode: 185.150.1.22 (hadoopslave-2) • HadoopSlavenode: 185.150.1.23 (hadoopslave-3) • HadoopSlavenode: 185.150.1.24 (hadoopslave-4)

Installation of Hadoop on Master Node

Let’s install Hadoop on the master node that is (HadoopMasternode: 185.150.1.20).

Step 1. Please edit the hosts file of the Master node and add the below entries.

cloudduggu@ubuntu:-$sudo nano /etc/hosts

185.150.1.20 hadoopmaster 185.150.1.21 hadoopslave-1 185.150.1.22 hadoopslave-2 185.150.1.23 hadoopslave-3 185.150.1.24 hadoopslave-4

Step 2. Verify if Java is installed on the Master node by using Java –version command. If it is installed you will receive the below message.

openjdk version "1.8.0_252" openjdk Runtime Environment (build 1.8.0_252-8u252-b09-1~16.04-b09) openjdk Server VM (build 25.252-b09, mixed mode)

Otherwise, you can install JAVA 8 from the below link.

cloudduggu@ubuntu:-$sudo apt-get install openjdk-8-jdk

Step 3. Once Java is installed, update the source list of files using the below command.

cloudduggu@ubuntu:-$sudo apt-get update

Step 4. Now configure SSH on the master node using the below command.

cloudduggu@ubuntu:-$sudo apt-get install openssh-server openssh-client

Step 5. Once SSH is installed, now generate key pair for passwordless SSH from master to slaves.

cloudduggu@ubuntu:-$ssh-keygen -t rsa -P ""

Step 6. Now copy the content of .ssh/id_rsa.pub from the master node to all slaves in .ssh/authorized_keys.

Step 7. Once the key is copied verify the login from the master node for all slave nodes.

$ssh hadoopslave-1 $ssh hadoopslave-2 $ssh hadoopslave-3 $ssh hadoopslave-4

Till now we can connect the slave machine from the master node by just supplying ssh and slave node name.

Step 8. Now we are ready to install Hadoop on the master node. Download the software from the below link.

https://archive.apache.org/dist/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz

In our case, it is present at the below location.

/home/cloudduggu/hadoop-3.1.2.tar.gz

Step 9. Now let us untar the file.

cloudduggu@ubuntu:-$tar xzf hadoop-3.1.2.tar.gz

Hadoop Configuration Files Setup

Step 10. Open the .bashrc file from the user’s home and add the below parameters to update the location of Hadoop.

$nano .bashrc

• export HADOOP_HOME="/home/cloudduggu/hadoop" • export PATH=$PATH:$HADOOP_HOME/bin • export PATH=$PATH:$HADOOP_HOME/sbin • export MAPRED_HOME=${HADOOP_HOME} • export HDFS_HOME=${HADOOP_HOME}

Step 11. Now we will set up a java home in the Hadoop-env. sh file.

Hadoop-env.sh file location:/home/cloudduggu/hadoop/etc/hadoop/

JAVA file location: /usr/lib/jvm/java-8-openjdk-i386/

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-i386/

Step 12. Now open the core-site.xml file which is located under ("/hadoop/etc/hadoop") and add the below parameter.

cloudduggu@ubuntu:~/hadoop/etc/hadoop/nano core-site.xml

<configuration> <property> <name>fs.defaultFS</name> <value>hdfs:// hadoopmaster:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/cloudduggu/hadoopdata</value> </property> </configuration>

Step 13. Open hdfs-site.xml file which is located under("/hadoop/etc/hadoop") and add the below parameters.

cloudduggu@ubuntu:~/hadoop/etc/hadoop/nano hdfs-site.xml

<configuration> <property> <name>dfs.replication</name> <value>2</value> </property> </configuration>

Step 14. Open mapred-site.xml file which is located under ("/hadoop/etc/hadoop") and add the below parameters.

cloudduggu@ubuntu:~/hadoop/etc/hadoop/nano mapred-site.xml

<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>

Step 15. Open yarn-site.xml file which is located under ("/hadoop/etc/hadoop") and add the below parameters.

cloudduggu@ubuntu:~/hadoop/etc/hadoop/nano yarn-site.xml

<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>hadoopmaster:8025</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> <value>hadoopmaster:8030</value> </property> <property> <name>yarn.resourcemanager.address</name> <value>hadoopmaster:8040</value> </property> </configuration>

Step 16. Now configure Master and slave nodes. ("/hadoop/etc/hadoop")

Configuring Master Node

$ vi etc/hadoop/masters hadoopmaster

Configuring Slave Node

$ vi etc/hadoop/slaves hadoopslave-1 hadoopslave-2 hadoopslave-3 hadoopslave-4

Now we have set up Hadoop successfully on Master Node, Let’s configure slave nodes.

Step 17. Now configure Master and slave nodes. ("/hadoop/etc/hadoop")

$sudo nano /etc/hosts 185.150.1.20 hadoopmaster 185.150.1.21 hadoopslave-1 185.150.1.22 hadoopslave-2 185.150.1.23 hadoopslave-3 185.150.1.24 hadoopslave-4

Step 18. Verify if Java is installed on all slave nodes by using Java –version command. If it is installed you will receive the below message.

openjdk version "1.8.0_252" openjdk Runtime Environment (build 1.8.0_252-8u252-b09-1~16.04-b09) openjdk Server VM (build 25.252-b09, mixed mode)

Otherwise, you can install JAVA 8 from the below link.

$sudo apt-get install openjdk-8-jdk

Step 19. Now copy the configuration file from the Master node to all Slave nodes using the below commands.

ON hadoopmaster$ tar czf hadoop.tar.gz hadoop-3.1.2.tar.gz ON hadoopmaster$ scp hadoop.tar.gz hadoopslave-1:~ ON hadoopmaster$ scp hadoop.tar.gz hadoopslave-2:~ ON hadoopmaster$ scp hadoop.tar.gz hadoopslave-3:~ ON hadoopmaster$ scp hadoop.tar.gz hadoopslave-4:~

Step 20. Now untar that file on all slave nodes.

ON hadoopslave-1 $tar xzf hadoop.tar.gz ON hadoopslave-2 $tar xzf hadoop.tar.gz ON hadoopslave-3 $tar xzf hadoop.tar.gz ON hadoopslave-4 $tar xzf hadoop.tar.gz

Step 21. Installation is completed on all Slave nodes, now format hdfs on master node using below command. (perform this activity only once otherwise it will erase all data from hdfs).

$bin/hdfs namenode -format

Step 22. Start HDFS and YARN from the master node using the below commands.

To start HDFS services run the below command.

$sbin/start-dfs.sh

To start YARN services run the below command.

$sbin/start-yarn.sh

Step 23. Verify services are running from master and slave nodes.

On the master node run the below command.

$jps NameNode ResourceManager

On the slave nodes, run the below command.

$jps DataNode NodeManager

So now we have completed Hadoop installation on multinode clusters.