晨曦同学(Dota界号称利神)前段时间分享了这样一个问题:如何在一个很大的文件中(该文件包含了中英文)找出出现频率比较高的几个词呢?我们来分析一下。找出现频率比较高的词语,首先要有一个支持中文的分词器(IK,庖丁解牛等等),这个问题不大;分词之后呢就要统计词语出现次数,类似于MapReduce程序中WordCount,这可是学习MapReduce的hello world程序呀,当然很容易搞定;最后还要来个排序,统计完了我们期望出现次数高的词语出现在前面,MapReduce默认就支持排序,也没问题。
解决这个问题需要两个Job,一个是统计Job,一个是排序Job。
统计Job的Mapper需要做的事情就是分词,这里我们选用IKanalyzer分词器,可能IK在官网上不好下载,我给大家准备好了,点此下载。分词之后,将每个单词个数置为1(跟WordCount程序一样)。
public static class AnalyzerMapper extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(Object key, Text value,
Mapper<Object, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
breakupSentence(value.toString(), context);
}
/**
* 用分词器将一段话拆分成多个词。
* 分出一个词就将数量置为1。
*
* @param sentence
* @param context
* @throws IOException
* @throws InterruptedException
*/
private void breakupSentence(String sentence, Mapper<Object, Text, Text,
IntWritable>.Context context) throws IOException, InterruptedException {
Analyzer analyzer = new IKAnalyzer(true);
TokenStream tokenStream = analyzer.tokenStream("content",
new StringReader(sentence));
tokenStream.addAttribute(CharTermAttribute.class);
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream
.getAttribute(CharTermAttribute.class);
word.set(charTermAttribute.toString());
context.write(word, one);
}
}
}
别忘了给IK设置停止词字典,过滤掉那些"了",”呢“,”啊“,”的“,"is", "and", "a" 之类的语气词、助词、连词、量词等。
IKAnalyzer.cfg.xml
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典
<entry key="ext_dict">ext.dic;</entry>
-->
<!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords">stopword.dic;chinese_stopword.dic</entry>
</properties>
chinese_stopword.dic
的
呢
吧
和
......
统计Job的Reducer就是统计各个词语的出现次数,跟WordCount程序中的完全一致,不再烦述。我们可以将该Reducer设置为Job的CombinerClass,这样每次Mapper Task向Reducer Task传递数据时候,先执行Combiner,将结果先做个统计,减少了Mapper向Reducer的数据传输。
public static class CountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
private IntWritable result = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
接下来再看排序Job,MapReduce任务是通过key来排序的,我们需要将词语出现的次数排序,所以需要先将统计Job的结果Key-Value互换,排序完成后,再换回来即可。
排序Job的Mapper将统计Job的结果Key-Value互换,代码如下:
public static class SortMapper extends Mapper<Object, Text, IntWritable, Text> {
private final static IntWritable wordCount = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(Object key, Text value,
Mapper<Object, Text, IntWritable, Text>.Context context)
throws IOException, InterruptedException {
StringTokenizer tokenizer = new StringTokenizer(value.toString());
while (tokenizer.hasMoreTokens()) {
String a = tokenizer.nextToken().trim();
word.set(a);
String b = tokenizer.nextToken().trim();
wordCount.set(Integer.valueOf(b));
context.write(wordCount, word);
}
}
}
排序Job的Reducer任务就是再将Key-Value倒置过来。
public static class SortReducer extends Reducer<IntWritable, Text, Text, IntWritable> {
private Text result = new Text();
@Override
protected void reduce(IntWritable key, Iterable<Text> values,
Reducer<IntWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
for (Text val : values) {
result.set(val.toString());
context.write(result, key);
}
}
}
Reducer默认排序是从小到大(数字),而我们期望出现次数多的词语排在前面,所以需要重写排序类WritableComparator。
public class DescWritableComparator extends WritableComparator {
protected DescWritableComparator() {
super(IntWritable.class, true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
return -super.compare(a, b);
}
}
如果有多个Reducer任务,Reducer的默认排序只是对发送到该Reducer下的数据局部排序。如果想达到全局排序,需要我们手动去写partitioner。Partitioner的作用是根据不同的key,制定相应的规则分发到不同的Reducer中。
public static class SortPartitioner<K, V> extends Partitioner<K, V> {
@Override
public int getPartition(K key, V value, int numReduceTasks) {
int maxValue = 50;
int keySection = 0;
// 只有传过来的key值大于maxValue 并且numReduceTasks比如大于1个才需要分区,否则直接返回0
if (numReduceTasks > 1 && key.hashCode() < maxValue) {
int sectionValue = maxValue / (numReduceTasks - 1);
int count = 0;
while ((key.hashCode() - sectionValue * count) > sectionValue) {
count++;
}
keySection = numReduceTasks - 1 - count;
}
return keySection;
}
}
最后就是链接MapReduce Job流,这里有两个Job,需要先执行统计Job,再执行排序Job。我们需要将统计Job的输出作为排序Job的输入。(友情提示:别忘了给统计Job设置Combiner哦,也别忘了给排序Job设置Comparator和Partitioner哦。)
Job job1 = new Job(configuration, "key word analyzer");
job1.setJarByClass(JobDefiner.class);
job1.setMapperClass(AnalyzerMapper.class);
job1.setCombinerClass(CountReducer.class);
job1.setReducerClass(CountReducer.class);
job1.setOutputKeyClass(Text.class);
job1.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job1, new Path(otherArgs[0]));
Path outPath1 = new Path(otherArgs[1]);
FileOutputFormat.setOutputPath(job1, outPath1);
job1.waitForCompletion(true);
Job job2 = new Job(configuration, "result sort");
job2.setJarByClass(JobDefiner.class);
job2.setOutputKeyClass(IntWritable.class);
job2.setOutputValueClass(Text.class);
job2.setMapperClass(SortKeyWordHandler.SortMapper.class);
job2.setReducerClass(SortKeyWordHandler.SortReducer.class);
// key按照降序排列
job2.setSortComparatorClass(DescWritableComparator.class);
job2.setPartitionerClass(SortKeyWordHandler.SortPartitioner.class);
FileInputFormat.addInputPath(job2, outPath1);
FileOutputFormat.setOutputPath(job2, new Path(otherArgs[2]));
job2.waitForCompletion(true);
大功告成?且慢!!在我的博客Eclipse远程调试Hadoop集群中,我们只讲了如何配置本地Eclipse如何远程调试Hadoop集群,在这里我们就演示一下如何去跑。
我们先上传两篇关于习大大的报道到hdfs上
bin/hadoop dfs -mkdir input
bin/hadoop dfs -put mupeng/files/test_chinese* input
刷一下Eclipse里面DFS Location就能看到
找到定义Job的main方法类,右键Run As=>Run Configurations ...
确认Project、Main class准确后,设置main方法的参数:统计Job的输入路径、统计Job的输出路径(同时也是排序Job的输入路径)、排序Job的输出路径。
hdfs://192.168.248.149:9000/user/mupeng/input
hdfs://192.168.248.149:9000/user/mupeng/output1
hdfs://192.168.248.149:9000/user/mupeng/output2
设置好后,点击Run,在第二个输出路径中,我们看到结果(我这只有一个Reducer)
引用 20
强调 16
习近平 14
斐济 13
我们 13
斐 12
中国 11
对 10
中 10
等 9
中方 9
为 9
方 9
......
最后提示大家:
本文相关源码下载地址(GitHub):点击查看
相关博客地址:Eclipse远程调试Hadoop集群
来源:oschina
链接:https://my.oschina.net/u/76720/blog/358138