How to specify KeyValueTextInputFormat Separator in Hadoop-.20 api?

后端 未结 7 738
盖世英雄少女心
盖世英雄少女心 2020-12-08 05:57

In new API (apache.hadoop.mapreduce.KeyValueTextInputFormat) , how to specify separator (delimiter) other than tab(which is default) to separate key and Value.

Samp

相关标签:
7条回答
  • 2020-12-08 06:04

    Example

    public class KeyValueTextInput extends Configured implements Tool {
        public static void main(String args[]) throws Exception {
            String log4jConfPath = "log4j.properties";
            PropertyConfigurator.configure(log4jConfPath);
            int res = ToolRunner.run(new KeyValueTextInput(), args);
            System.exit(res);
        }
    
        public int run(String[] args) throws Exception {
    

    Configuration conf = this.getConf();

            //conf.set("key.value.separator.in.input.line", ",");
    

    conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");

            Job job = Job.getInstance(conf, "WordCountSampleTemplate");
            job.setJarByClass(KeyValueTextInput.class);
            job.setMapperClass(Map.class);
            job.setReducerClass(Reduce.class);
    
            //job.setMapOutputKeyClass(Text.class);
            //job.setMapOutputValueClass(Text.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);
    
            job.setInputFormatClass(KeyValueTextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);
            FileInputFormat.addInputPath(job, new Path(args[0]));
            Path outputPath = new Path(args[1]);
            FileSystem fs = FileSystem.get(new URI(outputPath.toString()), conf);
            fs.delete(outputPath, true);
            FileOutputFormat.setOutputPath(job, outputPath);
            return job.waitForCompletion(true) ? 0 : 1;
        }
    }
    
    class Map extends Mapper<Text, Text, Text, Text> {
        public void map(Text k1, Text v1, Context context) throws IOException, InterruptedException {
            context.write(k1, v1);
        }
    }
    
    class Reduce extends Reducer<Text, Text, Text, Text> {
        public void reduce(Text Key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            String sum = " || ";
            for (Text value : values)
                sum = sum + value.toString() + " || ";
            context.write(Key, new Text(sum));
        }
    }
    
    0 讨论(0)
  • 2020-12-08 06:07

    It's a sequence matter.

    The first line conf.set("key.value.separator.in.input.line", ",") must come before you create an instance of Job class. So:

    conf.set("key.value.separator.in.input.line", ","); 
    Job job = new Job(conf);
    
    0 讨论(0)
  • 2020-12-08 06:07

    First, the new API did not finished in 0.20.* so if you want to use new API in 0.20.*, you should implement the feature by yourself.For example you can use FileInputFormat to achieve. Ignore the LongWritable key, and split the Text value on comma yourself.

    0 讨论(0)
  • 2020-12-08 06:12

    Please set the following in the Driver Code.

    conf.set("key.value.separator.in.input.line", ",");
    
    0 讨论(0)
  • 2020-12-08 06:17

    By default, the KeyValueTextInputFormat class uses tab as a separator for key and value from input text file.

    If you want to read the input from a custom separator, then you have to set the configuration with the attribute that you are using.

    For the new Hadoop APIs, it is different:

    conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ";");
    
    0 讨论(0)
  • 2020-12-08 06:18

    In the newer API you should use mapreduce.input.keyvaluelinerecordreader.key.value.separator configuration property.

    Here's an example:

    Configuration conf = new Configuration();
    conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");
    
    Job job = new Job(conf);
    job.setInputFormatClass(KeyValueTextInputFormat.class);
    // next job set-up
    
    0 讨论(0)
提交回复
热议问题