Export from pig to CSV

后端 未结 2 1645
鱼传尺愫
鱼传尺愫 2020-12-15 10:50

I\'m having a lot of trouble getting data out of pig and into a CSV that I can use in Excel or SQL (or R or SPSS etc etc) without a lot of manipulation ...

I\'ve tri

相关标签:
2条回答
  • 2020-12-15 11:23

    I'm afraid there isn't a one-liner which does the job,but you can come up with the followings (Pig v0.10.0):

    A = load '/user/hadoop/csvinput/somedata.txt' using PigStorage(',') 
          as (firstname:chararray, lastname:chararray, age:int, location:chararray);
    store A into '/user/hadoop/csvoutput' using PigStorage('\t','-schema');
    

    When PigStorage takes '-schema' it will create a '.pig_schema' and a '.pig_header' in the output directory. Then you have to merge '.pig_header' with 'part-x-xxxxx' :

    1. If result need to by copied to the local disk:

    hadoop fs -rm /user/hadoop/csvoutput/.pig_schema
    hadoop fs -getmerge /user/hadoop/csvoutput ./output.csv
    

    (Since -getmerge takes an input directory you need to get rid of .pig_schema first)

    2. Storing the result on HDFS:

    hadoop fs -cat /user/hadoop/csvoutput/.pig_header 
      /user/hadoop/csvoutput/part-x-xxxxx | 
        hadoop fs -put - /user/hadoop/csvoutput/result/output.csv
    

    For further reference you might also have a look at these posts:
    STORE output to a single CSV?
    How can I concatenate two files in hadoop into one using Hadoop FS shell?

    0 讨论(0)
  • 2020-12-15 11:26

    if you will store your data as PigStorage on HDFS and then merge it using -getmerge -nl:

    STORE pig_object INTO '/user/hadoop/csvoutput/pig_object'
        using PigStorage('\t','-schema');
    fs -getmerge -nl /user/hadoop/csvoutput/pig_object  /Users/Name/Folder/pig_object.csv;
    

    Docs:

    Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.

    you will have a single TSV/CSV file with the following structure:

    1 - header
    2 - empty line
    3 - pig schema
    4 - empty line
    5 - 1st line of DATA
    6 - 2nd line of DATA
    ...
    

    so we can simply remove lines [2,3,4] using AWK:

    awk 'NR==1 || NR>4 {print}' /Users/Name/Folder/pig_object.csv > /Users/Name/Folder/pig_object_clean.csv
    
    0 讨论(0)
提交回复
热议问题