大数据HadoopMap-Reduce
一 MapReduce入门
1.1 MapReduce
Mapreduce分布式运算“基于hadoop
Mapreducehadoop
1.2MapReduce优缺点
1.2.1 优点
1。它简单的实现一些接口,就可以完成一个分布式程序,这个分布式程序可以PCMapReduce
2。当你的计算资源不能得到满足的时候,你可以通过简单的增加机器来扩展它的计算能力。
3。MapReducePCHadoop
4PB绂荤嚎MapReduce
1.2.2 缺点
MapReduce不DAG有向图计算。
1MapReduceMysql
2MapReduceMapReduce
3DAGMapReduceMapReduceIO
1.3apReduce核心
1)分布式的运算程序往往需要分成至少2个阶段。
2)第一个阶段的maptask并发实例,完全并行运行,互不相干。
3)第二个阶段的reduce task并发实例互不相干,但是他们的数据依赖于上一个阶段的所有maptask并发实例的输出。
4)MapReduce编程模型只能包含一个map阶段和一个reduce阶段,如果用户的业务逻辑非常复杂,那就只能多个mapreduce程序,串行运行。
1.4apReduce进程
mapreduce
1MrAppMaster
2MapTaskmap
3ReduceTaskreduce
1.5MapReduce编程规范
MapperReducerDriver(mr)
1Mapper
1Mapper
(2MapperKVKV
(3Mappermap()
(4MapperKVKV
(5map()maptask<K,V>
2Reducer阶段
1Reducer
(2ReducerMapperKV
(3Reducerreduce()
(4Reducetaskk<k,v>reduce()
3Driver
Drvierjob瀵硅薄
Hadoop序列化
2.1
2.2
2.3 Java
重量级SerializableheaderhadoopWritable
2.4 Hadoop
HadoopRPCHadoop
HadoopRPCRPC
1紧凑:紧凑的格式能让我们充分利用网络带宽,而带宽是数据中心最稀缺的资
2快速:进程通信形成了分布式系统的骨架,所以需要尽量减少序列化和反序列化的性能开销,这是基本的;
3可扩展:协议为了满足新的需求变化,所以控制客户端和服务器过程中,需要直接引进相应的协议,这些是新协议,原序列化方式能支持新的协议报文;
4互操作
2.5 常用数据
常用hadoop类型
Java | Hadoop |
boolean | BooleanWritable |
byte | ByteWritable |
int | IntWritable |
float | FloatWritable |
long | LongWritable |
double | DoubleWritable |
string | Text |
map | MapWritable |
array | ArrayWritable |
今日案例:
MapReduceʵս
1.1 WordCount案例
1.1.11一堆个数
0)需求:在一堆给定的文本文件中统计输出每一个单词出现的总次数
1)数据:
2)
mapreduceMapperReducerDriver
3)编写程序
1mapper
package com.itstar.mapreduce; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ Text k = new Text(); IntWritable v = new IntWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 1 String line = value.toString(); // 2 String[] words = line.split(" "); // 3 for (String word : words) { k.set(word); context.write(k, v); } } } |
2reducer
package com.itstar.mapreduce.wordcount; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ @Override protected void reduce(Text key, Iterable<IntWritable> value, Context context) throws IOException, InterruptedException { // 1 int sum = 0; for (IntWritable count : value) { sum += count.get(); } // 2 context.write(key, new IntWritable(sum)); } } |
3
package com.itstar.mapreduce.wordcount; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordcountDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { // 1 Configuration configuration = new Configuration(); Job job = Job.getInstance(configuration); // 2 璁剧疆jar job.setJarByClass(WordcountDriver.class); // 3 璁剧疆mapReduce job.setMapperClass(WordcountMapper.class); job.setReducerClass(WordcountReducer.class); // 4 璁剧疆map job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); // 5 璁剧疆Reduce job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // 6 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 7 boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } } |
4)集群上测试
(1)将程序打成jar包拷贝hadoop。
(2)hadoop
(3)wordcount程序
5)本地测试
1windowsHADOOP_HOME环境
2idea程序
3idea在控制台
需要src“log4j.properties”,
log4j.rootLogger=INFO, stdout |
1.1.22:把单词ASCIIPartitioner
0
1
package import import import publicclassextends @Override publicintint // 1 key String firWord = key.toString().substring(0, 1); char[] charArray = firWord.toCharArray(); int // int // 2 if return }else return } } } |
2)在驱动reducetask个数
job.setPartitionerClass(WordCountPartitioner.class); job.setNumReduceTasks(2); |
1.1.33maptaskCombiner
0过程maptaskCombiner
1)数据:
1WordcountCombiner类Reducer
package import import import import publicclassextends @Override protectedvoid Context context) throws int for(IntWritable v :values){ count += v.get(); } // 2 写出 context.write(key, new } } |
2WordcountDriver驱动绫讳腑指定combiner
// 9 combinercombiner job.setCombinerClass(WordcountCombiner.class); |
二
1WordcountReducer作为combinerWordcountDriver驱动绫讳腑指定
// combinercombiner job.setCombinerClass(WordcountReducer.class); |
运行
1.1.44:大量的CombineTextInputFormat
0将输入。
15
2)实现
11wordcount,观察5
2WordcountDriver中如下,1
// InputFormat,它默认用的是TextInputFormat.class job.setInputFormatClass(CombineTextInputFormat.class); CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m CombineTextInputFormat.setMinInputSplitSize(job, 2097152);// 2m |