问题
I've being searching for a proper tutorial online about how to use map and reduce, but almost every code about WordCount sucks and doesn't really explain you how to use each function. I've seen everything about the theory, the keys, the map etc, but there is no CODE for example doing something different than WordCount.
I am using Ubuntu 20.10 on Virtual Box and Hadoop version 3.2.1 (if you need any more info comment me).
My task is to manage a file that contains several data for athletes that took place on the Olympics.
You will see that it contains a variety of info, like name, sex, age, weight, height etc.
I will show an example here (hope you understand it):
ID Name Sex Age Height Weight Team NOC Games Year Season City
Sport Event Medal
1 A Dijiang M 24 180 80 China CHN 1992 Summer 1992 Summer Barcelona
Basketball Basketball Men's Basketball NA
Until now, I had to deal with data that are same to all of the records, like name or ID,
which are similar to each other.
(imagine having one participant more than once, that is my problem
at different period of time, so reduce cant recognise the records as same)
If I could change the key / recognision of the reduce function to the name for example of the participant, then I should have my correct result.
In this code I search for players that won at least on medal.
My main is:
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class NewWordCount {
public static void main(String[] args) throws Exception {
if(args.length != 3) {
System.err.println("Give the correct arguments.");
System.exit(3);
}
// Job 1.
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "count");
job.setJarByClass(NewWordCount.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(NewWordMapper.class);
job.setCombinerClass(NewWordReducer.class);
job.setReducerClass(NewWordReducer.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job,new Path(args[1]));
job.waitForCompletion(true);
}
}
My Mapper is:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class NewWordMapper extends Mapper <LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable();
private Text word = new Text();
private String name = new String();
private String sex = new String();
private String age = new String();
private String team = new String();
private String sport = new String();
private String games = new String();
private String sum = new String();
private String gold = "Gold";
private String silver = "Silver";
private String bronze = "Bronze";
public void map(LongWritable key, Text value, Context context) throws IOException,InterruptedException {
if(((LongWritable)key).get() == 0) {
return;
}
String line = value.toString();
String[] arrOfStr = line.split(",");
int counter = 0;
for(String a : arrOfStr) {
if(counter == 14) {
// setting the type of medal each player has won.
word.set(a);
// checking if the medal is gold.
if(a.compareTo(gold) == 0 || a.compareTo(silver) == 0 || a.compareTo(bronze) == 0) {
String[] goldenStr = line.split(",");
name = goldenStr[1];
sex = goldenStr[2];
age = goldenStr[3];
team = goldenStr[6];
sport = goldenStr[12];
games = goldenStr[8];
sum = name + "," + sex + "," + age + "," + team + "," + sport + "," + games;
word.set(sum);
context.write(word, one);
}
}
counter++;
}
}
}
My Reducer is:
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
public class NewWordReducer extends Reducer <Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
int count = 0;
for(IntWritable val : values) {
String line = val.toString();
String[] arrOfStr = line.split(",");
String name = arrOfStr[0];
count += val.get();
}
context.write(key, new IntWritable(count));
}
}
回答1:
The core idea about MapReduce jobs is that the Map
function is used to extract valuable information from the input and "transform" it to key-value
pairs, where based on them the Reduce
function is gonna be executed for each key separately. Your code seems to show a misunderstanding about the ways of execution of the latter, but that's no biggie, because this is a proper example of the WordCount example.
Let's say we have a file with stats of olympic athletes and their medal performance like you showed under a directory named /olympic_stats
in the HDFS as shown below (you see I included records with the same athlete as this example needs to work upon):
1,A Dijiang,M,24,180,80,China,CHN,1992,Summer 1992,Summer,Barcelona,Basketball,Men's Basketball,NA
2,T Kekempourdas,M,33,189,85,Greece,GRE,2004,Summer 2004,Summer,Athens,Judo,Men's Judo,Gold
3,T Kekempourdas,M,33,189,85,Greece,GRE,2000,Summer 2000,Summer,Sydney,Judo,Men's Judo,Bronze
4,K Stefanidi,F,29,183,76,Greece,GRE,2016,Summer 2016,Summer,Rio,Pole Vault, Women's Pole Vault,Silver
5,A Jones,F,26,160,56,Canada,CAN,2012,Summer 2012,Summer,London,Acrobatics,Women's Acrobatics,Gold
5,A Jones,F,26,160,56,Canada,CAN,2016,Summer 2012,Summer,Rio,Acrobatics,Women's Acrobatics,Gold
6,C Glover,M,33,175,80,USA,USA,2008,Summer 2008,Summer,Beijing,Archery,Men's Archery,Gold
7,C Glover,M,33,175,80,USA,USA,2012,Summer 2012,Summer,London,Archery,Men's Archery,Gold
8,C Glover,M,33,175,80,USA,USA,2016,Summer 2016,Summer,Rio,Archery,Men's Archery,Gold
For the Map
function, we need to find the one column of data that is good to use as a key in order to calculate how many gold medals each athlete has. As we can easily see from above, every athlete can have one or more records and they all would have his/her name on the second column, so we are sure that we are going to use their name as key on the key-value
pairs. As for the value, well we do want to calculate how many gold medals an athlete has so we have to check the 14th column that indicates if and what medal this athlete got. If this record's column is equal to the String
Gold then we can be sure that this athlete has at least 1 gold medal in his career so far. So here, as the value, we can just put 1.
Now for the Reduce
function, as it is executed separately for each different key, we can understand that the input values it gets from the mappers are going to be for the same exact athlete. Since the key-value
pairs that were generated from the mappers had just 1 at their values for each gold medal for the given athlete, we could just add all these 1's up and get the total number of gold medals for each one of them.
So the code for this is like the one below (I'm putting the mapper, reducer, and driver in the same file for the sake of simplicity):
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import java.io.*;
import java.io.IOException;
import java.util.*;
import java.nio.charset.StandardCharsets;
public class GoldMedals
{
/* input: <byte_offset, line_of_dataset>
* output: <Athlete's Name, 1>
*/
public static class Map extends Mapper<Object, Text, Text, IntWritable>
{
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String record = value.toString();
String[] columns = record.split(",");
// extract the athlete's name and his/hers medal indication
String athlete_name = columns[1];
String medal = columns[14];
// only hold the gold medal athletes, with their name as the key
// and 1 as the least number of gold medals they have so far
if(medal.equals("Gold"))
context.write(new Text(athlete_name), new IntWritable(1));
}
}
/* input: <Athlete's Name, 1>
* output: <Athlete's Name, Athlete's Total Gold Medals>
*/
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>
{
public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException
{
int sum = 0;
// for a single athlete, add all of the gold medals they had so far...
for(IntWritable value : values)
sum += value.get();
// and write the result as the value on the output key-value pairs
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception
{
// set the paths of the input and output directories in the HDFS
Path input_dir = new Path("olympic_stats");
Path output_dir = new Path("gold_medals");
// in case the output directory already exists, delete it
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
if(fs.exists(output_dir))
fs.delete(output_dir, true);
// configure the MapReduce job
Job goldmedals_job = Job.getInstance(conf, "Gold Medals Counter");
goldmedals_job.setJarByClass(GoldMedals.class);
goldmedals_job.setMapperClass(Map.class);
goldmedals_job.setCombinerClass(Reduce.class);
goldmedals_job.setReducerClass(Reduce.class);
goldmedals_job.setMapOutputKeyClass(Text.class);
goldmedals_job.setMapOutputValueClass(IntWritable.class);
goldmedals_job.setOutputKeyClass(Text.class);
goldmedals_job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(goldmedals_job, input_dir);
FileOutputFormat.setOutputPath(goldmedals_job, output_dir);
goldmedals_job.waitForCompletion(true);
}
}
The output of the program above is stored inside the /olympic_stats_out
directory in the HDFS, which has the following output and confirms that the MapReduce
job was designed correctly:
来源:https://stackoverflow.com/questions/65084063/mapreduce-hadoop-on-linux-change-reduce-key