Pig - Remove embedded newlines and commas in gzip files

℡╲_俬逩灬. 提交于 2020-01-06 04:19:27

问题


I have a gzip file with data field separated by commas. I am currently using PigStorage to load the file as shown below:

A = load 'myfile.gz' USING PigStorage(',') AS (id,date,text);

The data in the gzip file has embedded characters - embedded newlines and commas. These characters exist in all the three fields - id, date and text. The embedded characters are always within the "" quotes.

I would like to replace or remove these characters using Pig before doing any further processing.

I think I need to first look for the occurrence of the "" quotes. Once I find these quotes, I need to look at the string within these quotes and search for the commas and new line characters in it. Once found, I need to replace them with a space or remove them.

How can I achieve this via Pig?


回答1:


Try this :

REGISTER piggybank.jar; 
A = LOAD 'myfile.gz' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:chararray,date:chararray,text:chararray);
B = FOREACH A GENERATE  REPLACE(REPLACE(id,'\n',''),',','') AS id, REPLACE(REPLACE(date,'\n',''),',','') AS date, REPLACE(REPLACE(text,'\n',''),',','') AS text;

We can use either : org.apache.pig.piggybank.storage.CSVExcelStorage() or org.apache.pig.piggybank.storage.CSVLoader().

Refer the below API links for details

  1. http://pig.apache.org/docs/r0.12.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
  2. http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/piggybank/storage/CSVLoader.html


来源:https://stackoverflow.com/questions/31394130/pig-remove-embedded-newlines-and-commas-in-gzip-files

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!