I have a 32 core system. When I run a MapReduce job using Hadoop I never see the java process use more than 150% CPU (according to top) and it usually stays around the 100%
I think you need to set "mapreduce.framework.name" to "yarn",because the default value is "local".
put the following into your mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
There could be two issues, which I outline below. I'd also like to point out that this is a very common question and you should look at the previously asked Hadoop questions.
Your mapred.tasktracker.map.tasks.maximum
could be set low in conf/mapred-site.xml
. This will be the issue if when you check the JobTracker, you see several pending tasks, but only a few running tasks. Each task is a single thread, so you would hypothetically need 32 maximum slots on that node.
Otherwise, likely your data is not being split into enough chunks. Are you running over a small amount of data? It could be that your MapReduce job is running over only a few input splits and thus does not require more mappers. Try running your job over hundreds of MB of data instead and see if you still have the same issue. Hadoop automatically splits your files. The number of blocks a file is split up into is the total size of the file divided by the block size. By default, one map task will be assigned to each block (not each file).
In your conf/hdfs-site.xml
configuration file, there is a dfs.block.size parameter
. Most people set this to 64 or 128mb. However, if you are trying to do something tiny you could set this up to split up the work more.
You can also manually split your file into 32 chunks.