The NameNode is the single most important Hadoop service. It maintains the locations of all of the data blocks in the cluster; in addition, it maintains the state of the distributed filesystem. When a NameNode fails, it is possible to recover from a previous checkpoint generated by the Secondary NameNode. It is important to note that the Secondary NameNode is not a backup for the NameNode. It performs a checkpoint process periodically. The data is almost certainly stale when recovering from a Secondary NameNode checkpoint. However, recovering from a NameNode failure using an old filesystem state is better than not being able to recover at all.
It is assumed that the system hosting the NameNode service has failed, and the Secondary NameNode is running on a separate machine. In addition, the fs.checkpoint.dir
property should have been set in the core-default.xml
file. This property tells the Secondary NameNode where to save the checkpoints on the local filesystem.
Carry out the following steps to recover from a NameNode failure:
$ cd /path/to/hadoop $ bin/hadoop-daemon.sh stop secondarynamenode
ssh
password-less login should be configured. In addition, it should have the same IP and hostname as the previous NameNode.fs.checkpoint.dir
on the Secondary NameNode to the dfs.name.dir
folder on the new NameNode machine.$ bin/hadoop-daemon.sh start namenode
$ bin/hadoop-daemon.sh start secondarynamenode
http://head:50070/
.We first logged into the Secondary NameNode and stopped the service. Next, we set up a new machine in the exact manner we set up the failed NameNode. Next, we copied all of the checkpoint and edit files from the Secondary NameNode to the new NameNode. This will allow us to recover the filesystem status, metadata, and edits at the time of the last checkpoint. Finally, we restarted the new NameNode and Secondary NameNode.
Recovering using the old data is unacceptable for certain processing environments. Instead, another option would be to set up some type of offsite storage where the NameNode can write its image and edits files. This way, if there is a hardware failure of the NameNode, you can recover the latest filesystem without resorting to restoring old data from the Secondary NameNode snapshot.
The first step in this would be to designate a new machine to hold the NameNode image and edit file backups. Next, mount the backup machine on the NameNode server. Finally, modify the hdfs-site.xml
file on the server running the NameNode to write to the local filesystem and the backup machine mount:
$ cd /path/to/hadoop $ vi conf/hdfs-site.xml <property> <name>dfs.name.dir</name> <value>/path/to/hadoop/cache/hadoop/dfs, /path/to/backup</value> </property>
Now the NameNode will write all of the filesystem metadata to both /path/to/hadoop/cache/hadoop/dfs
and the mounted /path/to/backup
folders.