Install Hadoop 3.2: Setting up a Single Node Hadoop Cluster

5 min readJul 23, 2019

Hadoop cluster is a collection of independent commodity hardware connected through a dedicated network(LAN) to work as a single centralized data processing resource. You can configure Hadoop Cluster in two modes; pseudo-distributed mode and fully-distributed mode.

Pseudo-Distributed Mode is also known as a single-node cluster where both NameNode and DataNode will be running on the same machine. HDFS will be used for storage and all the Hadoop daemons are configured on a single node. The fully-distributed mode is also known as the production phase of Hadoop where Name node and Data nodes will be configured on different machines and data will be distributed across data nodes.

In this article, we’ll look at the step by step instructions to install Hadoop in pseudo-distributed mode on CentOS 7.

Step 1: Create a Hadoop User

Create a new user with all root privileges. this user is to perform admin tasks of Hadoop. Start by logging in to your CentOS server as the root user.

Use the adduser command to add a new user to your system.

$ adduser hduser

Use the passwd command to update the new user’s password.

$ passwd hduser

By default, on CentOS, members of the wheel group have sudo privileges.

$ usermod -aG wheel hduser

Step 2: Installation of Java

Download and Install Oracle Java 8 JDK using the below commands

$ curl -L -b "oraclelicense=a" -O https://download.oracle.com/otn/java/jdk/8u212-b10/59066701cf1a433da9770636fbc4c9aa/jdk-8u212-linux-x64.rpm?AuthParam=1556006078_87220ee9f4a8e59beeeb3ff97c646447$ sudo yum localinstall jdk-8u212-linux-x64.rpm

(Or)

Download the Java SE Development Kit 8u212 file from the Oracle website: jdk-8u212-linux-x64.rpm

https://www.oracle.com/technetwork/java/javaee/downloads/jdk8-downloads-2133151.html$ sudo yum localinstall jdk-8u212-linux-x64.rpm

Setting up JAVA Environment Variables

Execute below command to edit the ~/.bashrc files

$ sudo gedit ~/.bashrc

Add the below variables to the file and save it

# Java Environment Variables
export JAVA_HOME=/usr/java/jdk1.8.0_212-amd64/
export JRE_HOME=/usr/java/jdk1.8.0_212amd64/jre
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export PATH

Use the source command to force Linux to reload the .bashrc file.

$ sudo  source ~/.bashrc

If you have multiple Java versions installed on the server you can change the default version using the alternatives system utility:

$ sudo alternatives --config java

To change the default Java version, just enter the number(JDk1.8.0_212) when prompted and hit Enter

Step 3: Setup SSH

Install OpenSSH Server:

Hadoop requires SSH access to all the nodes configured in the cluster. For the single-node setup of Hadoop, you need to configure SSH access to the localhost.

To install the server and client type: (Cent OS)

$ yum -y install openssh-server openssh-clients

then

$ sudo apt-get install openssh-client openssh-server

Start the service:

$ systemctl enable sshd.service$ systemctl start sshd.service

Make sure port 22 is opened:

$ apt install net-tools$ netstat -tulpn | grep :22

OpenSSH Server Configuration:

Edit /etc/ssh/sshd_config

$ gedit /etc/ssh/sshd_config

To enable root logins, edit or add as follows:

PermitRootLogin yes

Save and close the file. Restart sshd:

$ systemctl restart sshd.service

Set up password-less SSH:

Generating a new SSH public and private key pair on your local computer is the first step towards authenticating with a remote server without a password:

$ ssh-keygen -t rsa -P ''$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys$ chmod 0600 ~/.ssh/authorized_keys

Now check if you can SSH to the localhost without a passphrase by below command, shown as follows:

$ ssh localhost

Step 4: Download and Configure Hadoop

Download the Hadoop-3.2.0 tar file from the apache website:

$ wget -c -O hadoop.tar.gz http://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz$ mkdir /usr/local/hadoop$ chmod -R 755 /usr/local/hadoop$ tar -xzvf /root/hadoop.tar.gz$ mv /root/hadoop-3.2.0/* /usr/local/hadoop

Step 5 : Configure XML & Environment files

Add HADOOP_HOME environment variable pointing to your Hadoop installation and add the path to the bin. That will help you to run Hadoop commands from anywhere.

Edit $HOME/.bashrc file by adding the Hadoop environment variables.

$ sudo gedit ~/.bashrc#Hadoop Environment Variables
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root# Add Hadoop bin/ directory to PATHexport PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
$ sudo source ~/.bashrc

In order to run Hadoop, it should know the location of Java installed on your system.

add the Java and Hadoop environment variable in the hadoop-env.sh file.

$ cd $HADOOP_HOME/etc/hadoop$ gedit hadoop-env.sh#JAVA
export JAVA_HOME=/usr/java/jdk1.8.0_212-amd64/
export JRE_HOME=/usr/java/jdk1.8.0_212-amd64/jre#Hadoop Environment Variables
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root# Add Hadoop bin/ directory to PATHexport PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

You need to change XML files placed inside /etc/hadoop directory within our Hadoop installation folder. XML files that are to be changed and changes required are listed here.

Create a directory for NameNode and DataNode:

$ mkdir -p /usr/local/hadoop/hadoop_store/tmp$ chmod -R 755 /usr/local/hadoop/hadoop_store/tmp$ mkdir -p /usr/local/hadoop/hadoop_store/namenode$ mkdir -p /usr/local/hadoop/hadoop_store/datanode$ chmod -R 755 /usr/local/hadoop/hadoop_store/namenode$ chmod -R 755 /usr/local/hadoop/hadoop_store/datanode

You can override the default settings used to start Hadoop by changing these files.

1.core-site.xml: Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.

$ gedit $HADOOP_HOME/etc/hadoop/core-site.xml<configuration>
<property>       
<name>hadoop.tmp.dir</name>      <value>/usr/local/hadoop_store/tmp</value>     
</property>    
<property>      
<name>fs.defaultFS</name>     
<value>hdfs://localhost:9000</value>    
</property>
</configuration>

2.yarn-site.xml : Configuration settings for ResourceManager and NodeManager.

$ gedit $HADOOP_HOME/etc/hadoop/yarn-site.xml<configuration>
<property>
<name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value>
</property>
</configuration>

3.hdfs-site.xml: Configuration settings for HDFS daemons, the Namenode, the secondary Namenode, and the data nodes.

$ gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name><value>/usr/local/hdfs_store/namenode</value>
</property>
<property>
<name>dfs.data.dir</name><value>/usr/local/hdfs_store/datanode</value>
</property>
</configuration>

4.mapred-site.xml: Configuration settings for MapReduce Applications.

$ gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

*Configure HDFS — workers

Edit workers file to include localhost as data node as well.

$ gedit $HADOOP_HOME/etc/hadoop/workerslocalhost

Step 6: Start Hadoop Daemons

Change the directory to /usr/local/hadoop/sbin

$ cd /usr/local/hadoop/sbin

Format the namenode.

$ hadoop namenode -format

Start NameNode daemon and DataNode daemon.

$ start-dfs.sh

Start yarn daemons.

$ start-yarn.sh

(Or)

$ start-all.sh

Start History server

$ mapred --daemon start historyserver

Use the jps command to verify that all the daemons are running.

$ jps

6168 ResourceManager

6648 Jps

5997 SecondaryNameNode

5758 DataNode

5631 NameNode

6294 NodeManager