I\'m having a lot of trouble getting data out of pig and into a CSV that I can use in Excel or SQL (or R or SPSS etc etc) without a lot of manipulation ...
I\'ve tri
I'm afraid there isn't a one-liner which does the job,but you can come up with the followings (Pig v0.10.0):
A = load '/user/hadoop/csvinput/somedata.txt' using PigStorage(',')
as (firstname:chararray, lastname:chararray, age:int, location:chararray);
store A into '/user/hadoop/csvoutput' using PigStorage('\t','-schema');
When PigStorage takes '-schema
' it will create a '.pig_schema
' and a '.pig_header
' in the output directory. Then you have to merge '.pig_header
' with 'part-x-xxxxx
' :
1. If result need to by copied to the local disk:
hadoop fs -rm /user/hadoop/csvoutput/.pig_schema
hadoop fs -getmerge /user/hadoop/csvoutput ./output.csv
(Since -getmerge
takes an input directory you need to get rid of .pig_schema
first)
2. Storing the result on HDFS:
hadoop fs -cat /user/hadoop/csvoutput/.pig_header
/user/hadoop/csvoutput/part-x-xxxxx |
hadoop fs -put - /user/hadoop/csvoutput/result/output.csv
For further reference you might also have a look at these posts:
STORE output to a single CSV?
How can I concatenate two files in hadoop into one using Hadoop FS shell?
if you will store your data as PigStorage
on HDFS and then merge it using -getmerge -nl
:
STORE pig_object INTO '/user/hadoop/csvoutput/pig_object'
using PigStorage('\t','-schema');
fs -getmerge -nl /user/hadoop/csvoutput/pig_object /Users/Name/Folder/pig_object.csv;
Docs:
Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.
you will have a single TSV/CSV file with the following structure:
1 - header
2 - empty line
3 - pig schema
4 - empty line
5 - 1st line of DATA
6 - 2nd line of DATA
...
so we can simply remove lines [2,3,4]
using AWK:
awk 'NR==1 || NR>4 {print}' /Users/Name/Folder/pig_object.csv > /Users/Name/Folder/pig_object_clean.csv