问题
I have installed Hadoop in pseudo distributed mode on my laptop, OS is Ubuntu.
I have changed paths where hadoop will store its data (by default hadoop stores data in /tmp
folder)
hdfs-site.xml
file looks as below :
<property>
<name>dfs.data.dir</name>
<value>/HADOOP_CLUSTER_DATA/data</value>
</property>
Now whenever I restart machine and try to start hadoop cluster using start-all.sh
script, data node never starts. I confirmed that data node is not start by checking logs and by using jps
command.
Then I
- Stopped cluster using
stop-all.sh
script. - Formatted HDFS using
hadoop namenode -format
command. - Started cluster using
start-all.sh
script.
Now everything works fine even if I stop and start cluster again. Problem occurs only when I restart machine and try to start the cluster.
- Has anyone encountered similar problem?
- Why this is happening and
- How can we solve this problem?
回答1:
By changing dfs.datanode.data.dir away from /tmp
you indeed made the data (the blocks) survive across a reboot. However there is more to HDFS than just blocks. You need to make sure all the relevant dirs point away from /tmp
, most notably dfs.namenode.name.dir
(I can't tell what other dirs you have to change, it depends on your config, but the namenode dir is mandatory, could be also sufficient).
I would also recommend using a more recent Hadoop distribution. BTW, the 1.1 namenode dir setting is dfs.name.dir.
回答2:
For those who use hadoop 2.0 or above versions config file names may be different.
As this answer points out, go to the /etc/hadoop
directory of your hadoop installation.
Open the file hdfs-site.xml. This user configuration will override the default hadoop configurations, that are loaded by the java classloader before.
Add dfs.namenode.name.dir property and set a new namenode dir (default is file://${hadoop.tmp.dir}/dfs/name
).
Do the same for dfs.datanode.data.dir
property (default is file://${hadoop.tmp.dir}/dfs/data
).
For example:
<property>
<name>dfs.namenode.name.dir</name>
<value>/Users/samuel/Documents/hadoop_data/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/Users/samuel/Documents/hadoop_data/data</value>
</property>
Other property where a tmp dir appears is dfs.namenode.checkpoint.dir
. Its default value is: file://${hadoop.tmp.dir}/dfs/namesecondary
.
If you want, you can easily also add this property:
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/Users/samuel/Documents/hadoop_data/namesecondary</value>
</property>
来源:https://stackoverflow.com/questions/20142111/why-do-we-need-to-format-hdfs-after-every-time-we-restart-machine