How to convert .txt file to Hadoop's sequence file format

后端未结

关注

 7  1575

To effectively utilise map-reduce jobs in Hadoop, i need data to be stored in hadoop\'s sequence file format. However,currently the data is only in flat .txt format.Can anyo

相关标签:

7条回答

时光说笑

2020-11-29 01:57

So the way more simplest answer is just an "identity" job that has a SequenceFile output.

Looks like this in java:

    public static void main(String[] args) throws IOException,
        InterruptedException, ClassNotFoundException {

    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJobName("Convert Text");
    job.setJarByClass(Mapper.class);

    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);

    // increase if you need sorting or a special number of files
    job.setNumReduceTasks(0);

    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setInputFormatClass(TextInputFormat.class);

    TextInputFormat.addInputPath(job, new Path("/lol"));
    SequenceFileOutputFormat.setOutputPath(job, new Path("/lolz"));

    // submit and wait for completion
    job.waitForCompletion(true);
   }

0 讨论(0)

慢半拍i

2020-11-29 02:07

import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

//White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition. 

public class SequenceFileWriteDemo { 

    private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door", "Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };

    public static void main( String[] args) throws IOException { 
        String uri = args[ 0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create( uri), conf);
        Path path = new Path( uri);
        IntWritable key = new IntWritable();
        Text value = new Text();
        SequenceFile.Writer writer = null;
        try { 
            writer = SequenceFile.createWriter( fs, conf, path, key.getClass(), value.getClass());
            for (int i = 0; i < 100; i ++) { 
                key.set( 100 - i);
                value.set( DATA[ i % DATA.length]);
                System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); 
                writer.append( key, value); } 
        } finally 
        { IOUtils.closeStream( writer); 
        } 
    } 
}

0 讨论(0)

被撕碎了的回忆

2020-11-29 02:11
Be watchful with format specifier :.

For example (note the space between % and s), System.out.printf("[% s]\t% s\t% s\n", writer.getLength(), key, value); will give us java.util.FormatFlagsConversionMismatchException: Conversion = s, Flags =

Instead, we should use:
```
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value); 
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
北海茫月

2020-11-29 02:15

It depends on what the format of the TXT file is. Is it one line per record? If so, you can simply use TextInputFormat which creates one record for each line. In your mapper you can parse that line and use it whichever way you choose.

If it isn't one line per record, you might need to write your own InputFormat implementation. Take a look at this tutorial for more info.

0 讨论(0)
发布评论:

提交评论
- 加载中...
一生所求

2020-11-29 02:18

if you have Mahout installed - it has something called : seqdirectory -- which can do it

0 讨论(0)
发布评论:

提交评论
- 加载中...
谎友^

2020-11-29 02:21
You can also just create an intermediate table, LOAD DATA the csv contents straight into it, then create a second table as sequencefile (partitioned, clustered, etc..) and insert into select from the intermediate table. You can also set options for compression, e.g.,
```
set hive.exec.compress.output = true;
set io.seqfile.compression.type = BLOCK;
set mapred.output.compression.codec = org.apache.hadoop.io.compress.SnappyCodec;

create table... stored as sequencefile;

insert overwrite table ... select * from ...;
```
The MR framework will then take care of the heavylifting for you, saving you the trouble of having to write Java code.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页