MapReduce基础

匿名 (未验证) 提交于 2019-12-02 23:52:01

大数据HadoopMap-Reduce

MapReduce入门

1.1 MapReduce

Mapreduce分布式运算“基于hadoop

Mapreducehadoop

1.2MapReduce优缺点

1.2.1 优点

1它简单的实现一些接口,就可以完成一个分布式程序,这个分布式程序可以PCMapReduce

2当你的计算资源不能得到满足的时候,你可以通过简单的增加机器来扩展它的计算能力。

3MapReducePCHadoop

4PB绂荤嚎MapReduce

1.2.2 缺点

MapReduceDAG有向图计算。

1MapReduceMysql

2MapReduceMapReduce

3DAGMapReduceMapReduceIO

1.3apReduce核心

1)分布式的运算程序往往需要分成至少2个阶段。

2)第一个阶段的maptask并发实例,完全并行运行,互不相干。

3)第二个阶段的reduce task并发实例互不相干,但是他们的数据依赖于上一个阶段的所有maptask并发实例的输出。

4)MapReduce编程模型只能包含一个map阶段和一个reduce阶段,如果用户的业务逻辑非常复杂,那就只能多个mapreduce程序,串行运行。

1.4apReduce进程

mapreduce

1MrAppMaster

2MapTaskmap

3ReduceTaskreduce

1.5MapReduce编程规范

MapperReducerDriver(mr)

1Mapper

1Mapper

(2MapperKVKV

(3Mappermap()

(4MapperKVKV

(5map()maptask<K,V>

2Reducer阶段

1Reducer

(2ReducerMapperKV

(3Reducerreduce()

(4Reducetaskk<k,v>reduce()

3Driver

Drvierjob瀵硅薄

Hadoop序列化

2.1

2.2

2.3 Java

重量级SerializableheaderhadoopWritable

2.4 Hadoop

HadoopRPCHadoop

HadoopRPCRPC

1紧凑:紧凑的格式能让我们充分利用网络带宽,而带宽是数据中心最稀缺的资

2快速:进程通信形成了分布式系统的骨架,所以需要尽量减少序列化和反序列化的性能开销,这是基本的;

3可扩展:协议为了满足新的需求变化,所以控制客户端和服务器过程中,需要直接引进相应的协议,这些是新协议,原序列化方式能支持新的协议报文;

4互操作

2.5 常用数据

常用hadoop类型

Java

Hadoop

boolean

BooleanWritable

byte

ByteWritable

int

IntWritable

float

FloatWritable

long

LongWritable

double

DoubleWritable

string

Text

map

MapWritable

array

ArrayWritable

今日案例:

MapReduceʵս

1.1 WordCount案例

1.1.11一堆个数

0)需求:在一堆给定的文本文件中统计输出每一个单词出现的总次数

1)数据:

2)

mapreduceMapperReducerDriver

3)编写程序

1mapper

package com.itstar.mapreduce;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Mapper;

public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{

Text k = new Text();

IntWritable v = new IntWritable(1);

@Override

protected void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

// 1

String line = value.toString();

// 2

String[] words = line.split(" ");

// 3

for (String word : words) {

k.set(word);

context.write(k, v);

}

}

}

2reducer

package com.itstar.mapreduce.wordcount;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Reducer;

public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

@Override

protected void reduce(Text key, Iterable<IntWritable> value,

Context context) throws IOException, InterruptedException {

// 1

int sum = 0;

for (IntWritable count : value) {

sum += count.get();

}

// 2

context.write(key, new IntWritable(sum));

}

}

3

package com.itstar.mapreduce.wordcount;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordcountDriver {

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

// 1

Configuration configuration = new Configuration();

Job job = Job.getInstance(configuration);

// 2 璁剧疆jar

job.setJarByClass(WordcountDriver.class);

// 3 璁剧疆mapReduce

job.setMapperClass(WordcountMapper.class);

job.setReducerClass(WordcountReducer.class);

// 4 璁剧疆map

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(IntWritable.class);

// 5 璁剧疆Reduce

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

// 6

FileInputFormat.setInputPaths(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

// 7

boolean result = job.waitForCompletion(true);

System.exit(result ? 0 : 1);

}

}

4)集群上测试

(1)将程序打成jar包拷贝hadoop。

(2)hadoop

(3)wordcount程序

5)本地测试

1windowsHADOOP_HOME环境

2idea程序

3idea在控制台

需要src“log4j.properties”

log4j.rootLogger=INFO, stdout

1.1.22:把单词ASCIIPartitioner

0

1

package

import

import

import

publicclassextends

@Override

publicintint

// 1 key

String firWord = key.toString().substring(0, 1);

char[] charArray = firWord.toCharArray();

int

// int

// 2

if

return

}else

return

}

}

}

2)在驱动reducetask个数

job.setPartitionerClass(WordCountPartitioner.class);

job.setNumReduceTasks(2);

1.1.33maptaskCombiner

0过程maptaskCombiner

1)数据:

1WordcountCombiner类Reducer

package

import

import

import

import

publicclassextends

@Override

protectedvoid

Context context) throws

int

for(IntWritable v :values){

count += v.get();

}

// 2 写出

context.write(key, new

}

}

2WordcountDriver驱动绫讳腑指定combiner

// 9 combinercombiner

job.setCombinerClass(WordcountCombiner.class);

1WordcountReducer作为combinerWordcountDriver驱动绫讳腑指定

// combinercombiner

job.setCombinerClass(WordcountReducer.class);

运行

1.1.44:大量CombineTextInputFormat

0将输入。

15

2)实现

11wordcount,观察5

2WordcountDriver中如下,1

// InputFormat,它默认用的是TextInputFormat.class

job.setInputFormatClass(CombineTextInputFormat.class);

CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m

CombineTextInputFormat.setMinInputSplitSize(job, 2097152);// 2m

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!