问题
I am running a Pig script in the HortonWorks virtual machine with the goal of extracting certain parts of my XML dataset, and loading those parts into columns in an HCatalog table. On my local machine, I run my Pig script on the XML file and get an output file with all the extracted parts. However, for some reason when I run this same script in the HortonWorks VM the script appears to run successfully but the HCatalog table is still empty.
Here is my local script:
REGISTER piggybank.jar
items = LOAD 'data1.xml' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS (row:chararray);
data = FOREACH items GENERATE
REGEX_EXTRACT(row, 'Id="([^"]*)"', 1) AS id:int,
REGEX_EXTRACT(row, 'CreationDate="([^"]*)"', 1) AS creationdate:chararray,
REGEX_EXTRACT(row, 'Score="([^"]*)"', 1) AS score:int,
REGEX_EXTRACT(row, 'Title="([^"]*)"', 1) AS title:chararray;
STORE data INTO '/tmp/postsETLResults' USING PigStorage();
The one I'm using in HortonWorks:
REGISTER piggybank.jar
items = LOAD 'data1.xml' USING org.apache.pig.piggybank.storage.XMLLoader('row') AS (row:chararray);
data = FOREACH items GENERATE
REGEX_EXTRACT(row, 'Id="([^"]*)"', 1) AS id:int,
REGEX_EXTRACT(row, 'CreationDate="([^"]*)"', 1) AS creationdate:chararray,
REGEX_EXTRACT(row, 'Score="([^"]*)"', 1) AS score:int,
REGEX_EXTRACT(row, 'Title="([^"]*)"', 1) AS title:chararray;
STORE data into 'posts_table_1' USING org.apache.hcatalog.pig.HCatStorer();
validate = LOAD 'default.posts_table_1' USING org.apache.hcatalog.pig.HCatLoader();
Sample XML row (from the StackOverflow public dataset):
<row Id="149115" PostTypeId="2" ParentId="149078" CreationDate="2008-09-29T15:16:23.870" Score="1" Body="<p>I'm sure you can also have Oracle display a query plan so you can see exactly which index is used first.</p>
" OwnerDisplayName="user16324" LastActivityDate="2008-09-29T15:16:23.870" CommentCount="1" />
I created the HCatalog table manually, and all the correct fields exists and are of the correct type.
The strange thing is that if I do dump data
in Pig, I get no output. If I illustrate data
I see pieces of my data in the log, followed by large blank areas, followed by more data, and so on.
What am I missing here? I'd really like to take this messy XML file and get a neat table in HCatalog. Again, I get the results I'm looking for when running the local script on my machine, but when I run the second version designed for storing the output into the posts_table_1
HCatalog table, I get a success message but an empty table.
Alternatively, if I can just get the output on my local machine as a comma-delimited file, I can use that file and have HCatalog automatically load the data in the Hue interface. As of now, the output is space-delimited which is problematic in Hue because the titles of posts contain spaces.
Thanks in advance! This has me stumped.
回答1:
I found the issue. I created the HCatalog table manually and had used all of the default options, including the delimiter which was set to ^A (/100)
. My output had columns separated by Tab spaces (\t
), so when the table received the data, it found no ^A
delimiter and stored an empty dataset. I recreated the table to look for \t
and everything worked fine.
来源:https://stackoverflow.com/questions/22627693/pig-not-loading-data-into-hcatalog-table-hortonworks-sandbox