问题
I am trying to load a xml file into hive table. I am using xml serde [here][1]. I am able to load simple flat xml files. But when there are nested elements in the xml, I am using hive complex data types to store them (for e.g., array<struct>
). Below is the sample xml that I am trying to load. My goal is to load all elements, attributes and content into hive table.
<classif action="del">
<code>123</code>
<class action="aou">
<party>p1</party>
<description action="up">
<name action="aorup" ln="te">
this is name1
</name>
<name action="aorup" ln="tm">
this is name2
</name>
<name action="aorup" ln="hi">
this is name2
</name>
</description>
</class>
<class action="a">
<party>p2</party>
<description action="up">
<name action="aorup" ln="te">
this is name4
</name>
<name action="aorup" ln="tm">
this is name5
</name>
<name action="aorup" ln="hi">
this is name6
</name>
</description>
</class>
</classif>
Hive output that I am trying to get is...
{action:"del", classif:{code:"123", class:[{action:"aou", class:{party:"p1", description:{action:"up", description:[{action:"aorup", ln:"te", name:"this is name1"}, {action:"aorup", ln:"tm", name:"this is name2"}, {action:"aorup", ln:"hi", name:"this is name3"}]}}}, {action:"a", class:{party:"p2", description:{action:"up", description:[{action:"aorup", ln:"te", name:"this is name4"}, {action:"aorup", ln:"tm", name:"this is name5"}, {action:"aorup", ln:"hi", name:"this is name6"}]}}}]}}
I wanted to load this entire xml into a single hive column. I tried the following:
DROP TABLE classif;
CREATE TABLE classif(
classif STRUCT<
Action:STRING, classif:STRUCT<Code:STRING, class:ARRAY<STRUCT<Action:STRING, class:STRUCT<party:STRING, description:STRUCT<action:STRING,description:ARRAY<STRUCT<action:STRING,ln:STRING,name:STRING>>>
>>>
>>)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"xml.processor.class"="com.ximpleware.hive.serde2.xml.vtd.XmlProcessor",
"column.xpath.classif"="/classif")
STORED AS INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES ("xmlinput.start"="<classif ","xmlinput.end"= "</classif>");
Output I am getting:
{"action":"del","classif":{"code":"123","class":[{"action":null,"class":null},{"action":"up","class":null},{"action":null,"class":null},{"action":"up","class":null}]}}
来源:https://stackoverflow.com/questions/44494364/complex-xml-schema-to-hive-schema