How to convert .txt file to Hadoop's sequence file format

后端 未结 7 1575
独厮守ぢ
独厮守ぢ 2020-11-29 01:19

To effectively utilise map-reduce jobs in Hadoop, i need data to be stored in hadoop\'s sequence file format. However,currently the data is only in flat .txt format.Can anyo

相关标签:
7条回答
  • 2020-11-29 01:57

    So the way more simplest answer is just an "identity" job that has a SequenceFile output.

    Looks like this in java:

        public static void main(String[] args) throws IOException,
            InterruptedException, ClassNotFoundException {
    
        Configuration conf = new Configuration();
        Job job = new Job(conf);
        job.setJobName("Convert Text");
        job.setJarByClass(Mapper.class);
    
        job.setMapperClass(Mapper.class);
        job.setReducerClass(Reducer.class);
    
        // increase if you need sorting or a special number of files
        job.setNumReduceTasks(0);
    
        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);
    
        job.setOutputFormatClass(SequenceFileOutputFormat.class);
        job.setInputFormatClass(TextInputFormat.class);
    
        TextInputFormat.addInputPath(job, new Path("/lol"));
        SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz"));
    
        // submit and wait for completion
        job.waitForCompletion(true);
       }
    
    0 讨论(0)
  • 2020-11-29 02:07
    import java.io.IOException;
    import java.net.URI;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IOUtils;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.SequenceFile;
    import org.apache.hadoop.io.Text;
    
    //White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition. 
    
    public class SequenceFileWriteDemo { 
    
        private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };
    
        public static void main( String[] args) throws IOException { 
            String uri = args[ 0];
            Configuration conf = new Configuration();
            FileSystem fs = FileSystem.get(URI.create( uri), conf);
            Path path = new Path( uri);
            IntWritable key = new IntWritable();
            Text value = new Text();
            SequenceFile.Writer writer = null;
            try { 
                writer = SequenceFile.createWriter( fs, conf, path, key.getClass(), value.getClass());
                for (int i = 0; i < 100; i ++) { 
                    key.set( 100 - i);
                    value.set( DATA[ i % DATA.length]);
                    System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); 
                    writer.append( key, value); } 
            } finally 
            { IOUtils.closeStream( writer); 
            } 
        } 
    }
    
    0 讨论(0)
  • 2020-11-29 02:11

    Be watchful with format specifier :.

    For example (note the space between % and s), System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); will give us java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =

    Instead, we should use:

    System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value); 
    
    0 讨论(0)
  • 2020-11-29 02:15

    It depends on what the format of the TXT file is. Is it one line per record? If so, you can simply use TextInputFormat which creates one record for each line. In your mapper you can parse that line and use it whichever way you choose.

    If it isn't one line per record, you might need to write your own InputFormat implementation. Take a look at this tutorial for more info.

    0 讨论(0)
  • 2020-11-29 02:18

    if you have Mahout installed - it has something called : seqdirectory -- which can do it

    0 讨论(0)
  • 2020-11-29 02:21

    You can also just create an intermediate table, LOAD DATA the csv contents straight into it, then create a second table as sequencefile (partitioned, clustered, etc..) and insert into select from the intermediate table. You can also set options for compression, e.g.,

    set hive.exec.compress.output = true;
    set io.seqfile.compression.type = BLOCK;
    set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;
    
    create table... stored as sequencefile;
    
    insert overwrite table ... select * from ...;
    

    The MR framework will then take care of the heavylifting for you, saving you the trouble of having to write Java code.

    0 讨论(0)
提交回复
热议问题