大数据HadoopMap-Reduce

一 MapReduce入门

1.1 MapReduce

Mapreduce分布式运算“基于hadoop

Mapreducehadoop

1.2MapReduce优缺点

1.2.1 优点

1。它简单的实现一些接口，就可以完成一个分布式程序，这个分布式程序可以PCMapReduce

2。当你的计算资源不能得到满足的时候，你可以通过简单的增加机器来扩展它的计算能力。

3。MapReducePCHadoop

4PB绂荤嚎MapReduce

1.2.2 缺点

MapReduce不DAG有向图计算。

1MapReduceMysql

2MapReduceMapReduce

3DAGMapReduceMapReduceIO

1.3apReduce核心

1）分布式的运算程序往往需要分成至少2个阶段。

2）第一个阶段的maptask并发实例，完全并行运行，互不相干。

3）第二个阶段的reduce task并发实例互不相干，但是他们的数据依赖于上一个阶段的所有maptask并发实例的输出。

4）MapReduce编程模型只能包含一个map阶段和一个reduce阶段，如果用户的业务逻辑非常复杂，那就只能多个mapreduce程序，串行运行。

1.4apReduce进程

mapreduce

1MrAppMaster

2MapTaskmap

3ReduceTaskreduce

1.5MapReduce编程规范

MapperReducerDriver(mr)

1Mapper

（2MapperKVKV

（3Mappermap()

（4MapperKVKV

（5map()maptask<K,V>

2Reducer阶段

1Reducer

（2ReducerMapperKV

（3Reducerreduce()

（4Reducetaskk<k,v>reduce()

3Driver

Drvierjob瀵硅薄

Hadoop序列化

2.1

2.2 2.3 Java

重量级SerializableheaderhadoopWritable

2.4 Hadoop

HadoopRPCHadoop

HadoopRPCRPC

1紧凑：紧凑的格式能让我们充分利用网络带宽，而带宽是数据中心最稀缺的资

2快速：进程通信形成了分布式系统的骨架，所以需要尽量减少序列化和反序列化的性能开销，这是基本的；

3可扩展：协议为了满足新的需求变化，所以控制客户端和服务器过程中，需要直接引进相应的协议，这些是新协议，原序列化方式能支持新的协议报文；

4互操作

2.5 常用数据

常用hadoop类型

Java	Hadoop
boolean	BooleanWritable
byte	ByteWritable
int	IntWritable
float	FloatWritable
long	LongWritable
double	DoubleWritable
string	Text
map	MapWritable
array	ArrayWritable

今日案例：

MapReduceʵս

1.1 WordCount案例

1.1.11一堆个数

0）需求：在一堆给定的文本文件中统计输出每一个单词出现的总次数

1）数据：

2）

mapreduceMapperReducerDriver

3）编写程序

1mapper

package com.itstar.mapreduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

Text k = new Text();

IntWritable v = new IntWritable(1);

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

// 1

String line = value.toString();

// 2

String[] words = line.split(" ");

// 3

for (String word : words) {

k.set(word);

context.write(k, v);

}

2reducer

package com.itstar.mapreduce.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

@Override

protected void reduce(Text key, Iterable<IntWritable> value,

Context context) throws IOException, InterruptedException {

// 1

int sum = 0;

for (IntWritable count : value) {

sum += count.get();

}

// 2

context.write(key, new IntWritable(sum));

}

package com.itstar.mapreduce.wordcount;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordcountDriver {

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

// 1

Configuration configuration = new Configuration();

Job job = Job.getInstance(configuration);

// 2 璁剧疆jar

job.setJarByClass(WordcountDriver.class);

// 3 璁剧疆mapReduce

job.setMapperClass(WordcountMapper.class);

job.setReducerClass(WordcountReducer.class);

// 4 璁剧疆map

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

// 5 璁剧疆Reduce

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

// 6

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

// 7

boolean result = job.waitForCompletion(true);

System.exit(result ? 0 : 1);

}

4）集群上测试

（1）将程序打成jar包拷贝hadoop。

（2）hadoop

（3）wordcount程序

5）本地测试

1windowsHADOOP_HOME环境

2idea程序

3idea在控制台

需要src“log4j.properties”，

log4j.rootLogger=INFO, stdout

1.1.22：把单词ASCIIPartitioner

package

import

publicclassextends

@Override

publicintint

// 1 key

String firWord = key.toString().substring(0, 1);

char[] charArray = firWord.toCharArray();

int

// int

// 2

return

}else

return

}

2）在驱动reducetask个数

job.setPartitionerClass(WordCountPartitioner.class);

job.setNumReduceTasks(2);

1.1.33maptaskCombiner

0过程maptaskCombiner

1）数据：

1WordcountCombiner类Reducer

package

import

publicclassextends

@Override

protectedvoid

Context context) throws

int

for(IntWritable v :values){

count += v.get();

}

// 2 写出

context.write(key, new

}

2WordcountDriver驱动绫讳腑指定combiner

// 9 combinercombiner

job.setCombinerClass(WordcountCombiner.class);

二

1WordcountReducer作为combinerWordcountDriver驱动绫讳腑指定

// combinercombiner

job.setCombinerClass(WordcountReducer.class);

运行

1.1.44：大量的CombineTextInputFormat

0将输入。

2）实现

11wordcount，观察5

2WordcountDriver中如下，1

// InputFormat，它默认用的是TextInputFormat.class

job.setInputFormatClass(CombineTextInputFormat.class);

CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m

CombineTextInputFormat.setMinInputSplitSize(job, 2097152);// 2m

标签

MapReduce

Hadoop

mapreduce实例

apache

MapReduce基础

一 MapReduce入门

1.1 MapReduce

1.2MapReduce优缺点

1.2.1 优点

1.2.2 缺点

1.3apReduce核心

1.4apReduce进程

1.5MapReduce编程规范

Hadoop序列化

2.1

2.2

2.3 Java

2.4 Hadoop

2.5 常用数据

今日案例：

MapReduceʵս

1.1 WordCount案例

1.1.11一堆个数

1.1.22：把单词ASCIIPartitioner

1.1.33maptaskCombiner

1.1.44：大量的CombineTextInputFormat