Install Hadoop 3.2: Setting up a Single Node Hadoop Cluster
Hadoop cluster is a collection of independent commodity hardware connected through a dedicated network(LAN) to work as a single centralized data processing resource. You can configure Hadoop Cluster in two modes; pseudo-distributed mode and fully-distributed mode.
Pseudo-Distributed Mode is also known as a single-node cluster where both NameNode and DataNode will be running on the same machine. HDFS will be used for storage and all the Hadoop daemons are configured on a single node. The fully-distributed mode is also known as the production phase of Hadoop where Name node and Data nodes will be configured on different machines and data will be distributed across data nodes.
In this article, we’ll look at the step by step instructions to install Hadoop in pseudo-distributed mode on CentOS 7.
Step 1: Create a Hadoop User
Create a new user with all root privileges. this user is to perform admin tasks of Hadoop. Start by logging in to your CentOS server as the root user.
Use the adduser command to add a new user to your system.
$ adduser hduser
Use the passwd command to update the new user’s password.
$ passwd hduser
By default, on CentOS, members of the wheel group have sudo privileges.
$ usermod -aG wheel hduser
Step 2: Installation of Java
Download and Install Oracle Java 8 JDK using the below commands
$ curl -L -b "oraclelicense=a" -O https://download.oracle.com/otn/java/jdk/8u212-b10/59066701cf1a433da9770636fbc4c9aa/jdk-8u212-linux-x64.rpm?AuthParam=1556006078_87220ee9f4a8e59beeeb3ff97c646447$ sudo yum localinstall jdk-8u212-linux-x64.rpm
(Or)
Download the Java SE Development Kit 8u212 file from the Oracle website: jdk-8u212-linux-x64.rpm
https://www.oracle.com/technetwork/java/javaee/downloads/jdk8-downloads-2133151.html$ sudo yum localinstall jdk-8u212-linux-x64.rpm
Setting up JAVA Environment Variables
Execute below command to edit the ~/.bashrc files
$ sudo gedit ~/.bashrc
Add the below variables to the file and save it
# Java Environment Variables
export JAVA_HOME=/usr/java/jdk1.8.0_212-amd64/
export JRE_HOME=/usr/java/jdk1.8.0_212amd64/jre
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export PATH
Use the source command to force Linux to reload the .bashrc file.
$ sudo source ~/.bashrc
If you have multiple Java versions installed on the server you can change the default version using the alternatives system utility:
$ sudo alternatives --config java
To change the default Java version, just enter the number(JDk1.8.0_212) when prompted and hit Enter
Step 3: Setup SSH
Install OpenSSH Server:
Hadoop requires SSH access to all the nodes configured in the cluster. For the single-node setup of Hadoop, you need to configure SSH access to the localhost.
To install the server and client type: (Cent OS)
$ yum -y install openssh-server openssh-clients
then
$ sudo apt-get install openssh-client openssh-server
Start the service:
$ systemctl enable sshd.service$ systemctl start sshd.service
Make sure port 22 is opened:
$ apt install net-tools$ netstat -tulpn | grep :22
OpenSSH Server Configuration:
Edit /etc/ssh/sshd_config
$ gedit /etc/ssh/sshd_config
To enable root logins, edit or add as follows:
PermitRootLogin yes
Save and close the file. Restart sshd:
$ systemctl restart sshd.service
Set up password-less SSH:
Generating a new SSH public and private key pair on your local computer is the first step towards authenticating with a remote server without a password:
$ ssh-keygen -t rsa -P ''$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys$ chmod 0600 ~/.ssh/authorized_keys
Now check if you can SSH to the localhost without a passphrase by below command, shown as follows:
$ ssh localhost
Step 4: Download and Configure Hadoop
Download the Hadoop-3.2.0 tar file from the apache website:
$ wget -c -O hadoop.tar.gz http://www-eu.apache.org/dist/hadoop/common/hadoop-3.2.0/hadoop-3.2.0.tar.gz$ mkdir /usr/local/hadoop$ chmod -R 755 /usr/local/hadoop$ tar -xzvf /root/hadoop.tar.gz$ mv /root/hadoop-3.2.0/* /usr/local/hadoop
Step 5 : Configure XML & Environment files
Add HADOOP_HOME environment variable pointing to your Hadoop installation and add the path to the bin. That will help you to run Hadoop commands from anywhere.
Edit $HOME/.bashrc file by adding the Hadoop environment variables.
$ sudo gedit ~/.bashrc#Hadoop Environment Variables
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root# Add Hadoop bin/ directory to PATHexport PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
$ sudo source ~/.bashrc
In order to run Hadoop, it should know the location of Java installed on your system.
add the Java and Hadoop environment variable in the hadoop-env.sh file.
$ cd $HADOOP_HOME/etc/hadoop$ gedit hadoop-env.sh#JAVA
export JAVA_HOME=/usr/java/jdk1.8.0_212-amd64/
export JRE_HOME=/usr/java/jdk1.8.0_212-amd64/jre#Hadoop Environment Variables
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_LOG_DIR=$HADOOP_HOME/logs
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root# Add Hadoop bin/ directory to PATHexport PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
You need to change XML files placed inside /etc/hadoop directory within our Hadoop installation folder. XML files that are to be changed and changes required are listed here.
Create a directory for NameNode and DataNode:
$ mkdir -p /usr/local/hadoop/hadoop_store/tmp$ chmod -R 755 /usr/local/hadoop/hadoop_store/tmp$ mkdir -p /usr/local/hadoop/hadoop_store/namenode$ mkdir -p /usr/local/hadoop/hadoop_store/datanode$ chmod -R 755 /usr/local/hadoop/hadoop_store/namenode$ chmod -R 755 /usr/local/hadoop/hadoop_store/datanode
You can override the default settings used to start Hadoop by changing these files.
1.core-site.xml: Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce.
$ gedit $HADOOP_HOME/etc/hadoop/core-site.xml<configuration>
<property>
<name>hadoop.tmp.dir</name> <value>/usr/local/hadoop_store/tmp</value>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
2.yarn-site.xml : Configuration settings for ResourceManager and NodeManager.
$ gedit $HADOOP_HOME/etc/hadoop/yarn-site.xml<configuration>
<property>
<name>yarn.nodemanager.aux-services</name><value>mapreduce_shuffle</value>
</property>
</configuration>
3.hdfs-site.xml: Configuration settings for HDFS daemons, the Namenode, the secondary Namenode, and the data nodes.
$ gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.name.dir</name><value>/usr/local/hdfs_store/namenode</value>
</property>
<property>
<name>dfs.data.dir</name><value>/usr/local/hdfs_store/datanode</value>
</property>
</configuration>
4.mapred-site.xml: Configuration settings for MapReduce Applications.
$ gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
*Configure HDFS — workers
Edit workers file to include localhost as data node as well.
$ gedit $HADOOP_HOME/etc/hadoop/workerslocalhost
Step 6: Start Hadoop Daemons
Change the directory to /usr/local/hadoop/sbin
$ cd /usr/local/hadoop/sbin
Format the namenode.
$ hadoop namenode -format
Start NameNode daemon and DataNode daemon.
$ start-dfs.sh
Start yarn daemons.
$ start-yarn.sh
(Or)
$ start-all.sh
Start History server
$ mapred --daemon start historyserver
Use the jps command to verify that all the daemons are running.
$ jps
6168 ResourceManager
6648 Jps
5997 SecondaryNameNode
5758 DataNode
5631 NameNode
6294 NodeManager
6648 Jps
5997 SecondaryNameNode
5758 DataNode
5631 NameNode
You are Single node Hadoop Cluster is ready!.
NameNode – http://localhost:50070/
ResourceManager – http://localhost:8088/