Extract first line of CSV file in Pig

后端 未结 2 1399
情话喂你
情话喂你 2021-01-23 09:31

I have several CSV files and the header is always the first line in the file. What\'s the best way to get that line out of the CSV file as a string in Pig? Preprocessing with se

相关标签:
2条回答
  • 2021-01-23 09:37

    Disclaimer: I'm not great with Java.

    You are going to need a UDF. I'm not sure exactly what you are asking for, but this UDF will take a series of CSV files and turn them into maps, where the keys are the values at the top of the file. This should hopefully be enough of a skeleton so that you can change it into what you want.

    The couple of tests I've done remotely and locally indicate that this will work.

    package myudfs;
    import java.io.IOException;
    import org.apache.pig.LoadFunc;
    
    import java.util.Map;
    import java.util.HashMap;
    import java.util.ArrayList;
    import org.apache.pig.data.Tuple;
    import org.apache.pig.data.TupleFactory;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.InputFormat;
    import org.apache.hadoop.mapreduce.RecordReader;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    
    import org.apache.pig.PigException;
    import org.apache.pig.backend.executionengine.ExecException;
    import org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigSplit;
    
    public class ExampleCSVLoader extends LoadFunc {
        protected RecordReader in = null;
        private String fieldDel = "" + '\t';
        private Map<String, String> outputMap = null;
        private TupleFactory mTupleFactory = TupleFactory.getInstance();
    
        // This stores the fields that are defined in the first line of the file
        private ArrayList<Object> topfields = null;
    
        public ExampleCSVLoader() {}
    
        public ExampleCSVLoader(String delimiter) {
            this();
            this.fieldDel = delimiter;
        }
    
        @Override
        public Tuple getNext() throws IOException {
            try {
                boolean notDone = in.nextKeyValue();
                if (!notDone) {
                    outputMap = null;
                    topfields = null;
                    return null;
                }
    
                String value = in.getCurrentValue().toString();
                String[] values = value.split(fieldDel);
                Tuple t =  mTupleFactory.newTuple(1);
    
                ArrayList<Object> tf = new ArrayList<Object>();
    
                int pos = 0;
                for (int i = 0; i < values.length; i++) {
                    if (topfields == null) {
                        tf.add(values[i]);
                    } else {
                        readField(values[i], pos);
                        pos = pos + 1;
                    }
                }
                if (topfields == null) {
                    topfields = tf;
                    t = mTupleFactory.newTuple();
                } else {
                    t.set(0, outputMap);
                }
    
                outputMap = null;
                return t;
            } catch (InterruptedException e) {
                int errCode = 6018;
                String errMsg = "Error while reading input";
                throw new ExecException(errMsg, errCode,
                        PigException.REMOTE_ENVIRONMENT, e);
            }
    
        }
    
        // Applies foo to the appropriate value in topfields
        private void readField(String foo, int pos) {
            if (outputMap == null) {
                outputMap = new HashMap<String, String>();
            }
            outputMap.put((String) topfields.get(pos), foo);
        }
    
        @Override
        public InputFormat getInputFormat() {
            return new TextInputFormat();
        }
    
        @Override
        public void prepareToRead(RecordReader reader, PigSplit split) {
            in = reader;
        }
    
        @Override
        public void setLocation(String location, Job job)
                throws IOException {
            FileInputFormat.setInputPaths(job, location);
        }
    }
    

    Sample output loading a directory with:

    csv1.in             csv2.in
    -------            ---------
    A|B|C               D|E|F
    Hello|This|is       PLEASE|WORK|FOO
    FOO|BAR|BING        OR|EVERYTHING|WILL
    BANG|BOSH           BE|FOR|NAUGHT
    

    Produces this output:

    A: {M: map[]}
    ()
    ([D#PLEASE,E#WORK,F#FOO])
    ([D#OR,E#EVERYTHING,F#WILL])
    ([D#BE,E#FOR,F#NAUGHT])
    ()
    ([A#Hello,B#This,C#is])
    ([A#FOO,B#BAR,C#BING])
    ([A#BANG,B#BOSH])
    

    The ()s are the top lines of the file. getNext() requires that we return something, otherwise the file will stop being processed. Therefore they return a null schema.

    0 讨论(0)
  • 2021-01-23 09:58

    If your CSV comply with CSV conventions of Excel 2007 you can use already available loader from Piggybank http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java?view=markup

    It has an option to skip the CSV header SKIP_INPUT_HEADER

    0 讨论(0)
提交回复
热议问题