问题
I have a 6 node cluster - 5 DN and 1 NN. All have 32 GB RAM. All slaves have 8.7 TB HDD. DN has 1.1 TB HDD. Here is the link to my core-site.xml , hdfs-site.xml , yarn-site.xml.
After running an MR job, i checked my RAM Usage which is mentioned below:
Namenode
free -g
total used free shared buff/cache available
Mem: 31 7 15 0 8 22
Swap: 31 0 31
Datanode :
Slave1 :
free -g
total used free shared buff/cache available
Mem: 31 6 6 0 18 24
Swap: 31 3 28
Slave2:
total used free shared buff/cache available
Mem: 31 2 4 0 24 28
Swap: 31 1 30
Likewise, other slaves have similar RAM usage. Even if a single job is submitted, the other submitted jobs enter into ACCEPTED
state and wait for the first job to finish and then they start.
Here is the output of ps
command of the JAR that I submnitted to execute the MR job:
/opt/jdk1.8.0_77//bin/java -Dproc_jar -Xmx1000m
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir= -Dyarn.id.str= -Dhadoop.root.logger=INFO,console
-Dyarn.root.logger=INFO,console -Dyarn.policy.file=hadoop-policy.xml
-Dhadoop.log.dir=/home/hduser/hadoop/logs -Dyarn.log.dir=/home/hduser/hadoop/logs
-Dhadoop.log.file=yarn.log -Dyarn.log.file=yarn.log
-Dyarn.home.dir=/home/hduser/hadoop -Dhadoop.home.dir=/home/hduser/hadoop
-Dhadoop.root.logger=INFO,console -Dyarn.root.logger=INFO,console
-classpath --classpath of jars
org.apache.hadoop.util.RunJar abc.jar abc.mydriver2 /raw_data /mr_output/02
Is there any settings that I can change/add to allow multiple jobs to run simultaneously and speed up current data processing ? I am using hadoop 2.5.2. The cluster is in PROD environment and I can not take it down for updating hadoop version.
EDIT 1 : I started a new MR job with 362 GB of data and still the RAM usage is around 8 GB and 22 GB of RAM is free. Here is my job submission command -
nohup yarn jar abc.jar def.mydriver1 /raw_data /mr_output/01 &
Here is some more information :
18/11/22 14:09:07 INFO input.FileInputFormat: Total input paths to process : 130363
18/11/22 14:09:10 INFO mapreduce.JobSubmitter: number of splits:130372
Is there some additional memory parameters that we can use to submit the job to have efficient memory usage ?
回答1:
I believe you can edit the mapred-default.xml
The Params you are looking for are
- mapreduce.job.running.map.limit
- mapreduce.job.running.reduce.limit
0 (Probably what it is set too at the moment) means UNLIMITED.
Looking at your Memory 32G/Machine seems too small.
What CPU/Cores are you having ? I would expect Quad CPU/16 Cores Minimum. Per Machine.
回答2:
Based on your yarn-site.xml
your yarn.scheduler.minimum-allocation-mb
setting of 10240 is too high. This effectively means you only have at best 18 vcores available. This might be the right setting for a cluster where you have tons of memory but for 32GB it's way too large. Drop it to 1 or 2GB.
Remember, HDFS block sizes are what each mapper typically consumes. So 1-2GB of memory for 128MB of data sounds more reasonable. The added benefit is you could have up to 180 vcores available which will process jobs 10x faster than 18 vcores.
回答3:
To give you an idea of how a 4 node 32 core 128GB RAM per node cluster is setup:
For Tez: Divide RAM/CORES = Max TEZ Container size So in my case: 128/32 = 4GB
TEZ:
YARN:
来源:https://stackoverflow.com/questions/53406172/hadoop-multinode-cluster-too-slow-how-do-i-increase-speed-of-data-processing