问题
Hi i have an application that reads records from HBase and writes into text files. Application is working as expected but when tested this for huge data it is taking 1.20 hour to complete the job . Here is the details of my application
- Size the data in the HBase is 400 GB approx 2 billions records .
- I have created 400 regions in the HBase tabl so 400 mappers .
- I have used custom Partitioner that puts records into 194 text files.
- I have lzo compression for map output and gzip for final output.
- I have used md5 hashing for my row key
I have used custom partitioner for my data segregation . I have 194 partitioner and reducer and all reducer gets completed very fast except last two that has very huge no of records because of the condition .
I don not know how handle this situation.
My condition is such that two partitoner will get large no of records and i can not change that also .
All reducer gets completed within 3 minutes but because of that overall job takes 30 mintes of time .
Here is my implementation
hbaseConf.set("mapreduce.map.output.compress", "true");
hbaseConf.set("mapreduce.map.output.compress.codec", "org.apache.hadoop.io.compress.SnappyCodec");
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
My Partitioner logic is here
if (str.contains("Japan|^|2017|^|" + strFileName + "")) {
return 0;
} else if (str.contains("Japan|^|2016|^|" + strFileName + "")) {
return 1;
} else if (str.contains("Japan|^|2015|^|" + strFileName + "")) {
return 2;
} else if (str.contains("Japan|^|2014|^|" + strFileName + "")) {
return 3;
} else if (str.contains("Japan|^|2013|^|" + strFileName + "")) {
return 4;
}
来源:https://stackoverflow.com/questions/43112385/one-reducer-in-custom-partitioner-makes-mapreduce-jobs-slower