MapReduce Hadoop on Linux - Multiple data on input

问题

I am using Ubuntu 20.10 on Virtual Box and Hadoop version 3.2.1 (if you need any more info comment me).
My output at this moment gives me this :

Aaron Wells Peirsol ,M,17,United States,Swimming,2000 Summer,0,1,0
Aaron Wells Peirsol ,M,21,United States,Swimming,2004 Summer,1,0,0
Aaron Wells Peirsol ,M,25,United States,Swimming,2008 Summer,0,1,0
Aaron Wells Peirsol ,M,25,United States,Swimming,2008 Summer,1,0,0

For the above output I would like to be able to sum all of his medals
(the three numbers at the end of the string represent the gold,silver,bronze
medal the participant has won over the years on Olympic Games).

The project had no specification on which age (17,21,25,25)
or when it happend (2000,2004,2008,2008 Summer), but i have to add the medals
in order to be able to sort them by the participant that has won the most gold medals etc.

Any ideas? If you need I can provide you my code but I need another one MapReduce I guess that will use the given input I imported above and give us something like :

Aaron Wells Peirsol,M,25,United States,Swimming,2008 Summer,2,2,0

If we have a way to remove "\t" from the reduce output it would be very beneficial too!

Thank you all for your time, Gyftonikolos Nikolaos.

回答1:

Although it might seem a bit tricky at first, this is yet another case of the WordCount example, only this time composite key and values are needed in order to fuel the data from the mapper into the reducer in the form key-value pairs.

For the mapper, we need to extract all info from each line of the input file and divide the data in the columns into two "categories":

the main info that are always the same for each athlete for the key
the stat info that change from line to line and are needed to be edited upon

For each athlete's lines, we understand that the columns that never change are the athlete's name, sex, country, and sport. All these are going to be considered a key by using the , character as a delimiter between each type of data. The rest of the column data are going to be put in the value side of the key-value pairs, but we need to use delimiters on them too, in order to firstly differentiate the medal counters from each age and olympic games year. We are going to use:

the @ character as a delimiter between the age and year,
the # character as a delimiter between the medal counters,
and the _ character as a delimiter between those two

At the Reduce function all that we have to do is actually count the medals to find their total and find the latest age and year of each athlete.

In order to not have a tab character between the keys and values at the output of the MapReduce job we can simply set NULL as the key of the key-value pair generated by the reducer and put all the data that were compute at the value of each pair, using the , character as a delimiter.

The code for this job looks like this:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.*;
import java.io.IOException;
import java.util.*;
import java.nio.charset.StandardCharsets;


public class Medals 
{
    /* input:  <byte_offset, line_of_dataset>
     * output: <(name,sex,country,sport), (age@year_gold#silver#bronze)>
     */
    public static class Map extends Mapper<Object, Text, Text, Text> 
    {
        public void map(Object key, Text value, Context context) throws IOException, InterruptedException 
        {
            String record = value.toString();
            String[] columns = record.split(",");

            // extract athlete's main info
            String name = columns[0];
            String sex = columns[1];
            String country = columns[3];
            String sport = columns[4];

            // extract athlete's stat info
            String age = columns[2];
            String year = columns[5]; 
            String gold = columns[6];
            String silver = columns[7];
            String bronze = columns[8];

            // set the main info as key and the stat info as value
            context.write(new Text(name + "," + sex + "," + country + "," + sport), new Text(age + "@" + year + "_" +  gold + "#" + silver + "#" + bronze));
        }
    }

    /* input:  <(name,sex,country,sport), (age@year_gold#silver#bronze)>
     * output: <(NULL, (name,sex,age,country,sport,year,golds,silvers,bronzes)>
     */
    public static class Reduce extends Reducer<Text, Text, NullWritable, Text>
    {
        public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException 
        {
            // extract athlete's main info
            String[] athlete_info = key.toString().split(",");
            String name = athlete_info[0];
            String sex = athlete_info[1];
            String country = athlete_info[2];
            String sport = athlete_info[3];

            int latest_age = 0;
            String latest_games = "";
            
            int gold_cnt = 0;
            int silver_cnt = 0;
            int bronze_cnt = 0;

            // for a single athlete, compute their stats...
            for(Text value : values)
            {
                String[] split_value = value.toString().split("_");
                String[] age_and_year = split_value[0].split("@");
                String[] medals = split_value[1].split("#");

                // find the last age and games the athlete has stats in the input file
                if(Integer.parseInt(age_and_year[0]) > latest_age)
                {
                    latest_age = Integer.parseInt(age_and_year[0]);
                    latest_games = age_and_year[1];
                }
                
                if(Integer.parseInt(medals[0]) == 1)
                    gold_cnt++;

                if(Integer.parseInt(medals[1]) == 1)
                    silver_cnt++;

                if(Integer.parseInt(medals[2]) == 1)
                    bronze_cnt++;
            }

            context.write(NullWritable.get(), new Text(name + "," + sex + "," + String.valueOf(latest_age) + "," + country + "," + sport + "," + latest_games + "," + String.valueOf(gold_cnt) + "," + String.valueOf(silver_cnt) + "," + String.valueOf(bronze_cnt)));
        }
    }


    public static void main(String[] args) throws Exception
    {
        // set the paths of the input and output directories in the HDFS
        Path input_dir = new Path("olympic_stats");
        Path output_dir = new Path("medals");

        // in case the output directory already exists, delete it
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);
        if(fs.exists(output_dir))
            fs.delete(output_dir, true);

        // configure the MapReduce job
        Job medals_job = Job.getInstance(conf, "Medals Counter");
        medals_job.setJarByClass(Medals.class);
        medals_job.setMapperClass(Map.class);
        medals_job.setReducerClass(Reduce.class);    
        medals_job.setMapOutputKeyClass(Text.class);
        medals_job.setMapOutputValueClass(Text.class);
        medals_job.setOutputKeyClass(NullWritable.class);
        medals_job.setOutputValueClass(Text.class);
        FileInputFormat.addInputPath(medals_job, input_dir);
        FileOutputFormat.setOutputPath(medals_job, output_dir);
        medals_job.waitForCompletion(true);
    }
}

And of course the result is how you've wanted it to be as seen below:

来源：https://stackoverflow.com/questions/65159570/mapreduce-hadoop-on-linux-multiple-data-on-input

标签

java

Hadoop

MapReduce