问题
I am using the below code to parse xml data in Hive. In my xml data, a few tags are repeating so I am using the brickhouse jar and lateral view to parse the tags and place in Hive tables. But when I am executing my code, I am getting an error. Please help as I am not able to understand what I am doing wrong.
Code:
add jar /home/cloudera/brickhouse-0.5.5.jar;
CREATE TEMPORARY FUNCTION numeric_range AS 'brickhouse.udf.collect.NumericRange';
CREATE TEMPORARY FUNCTION array_index AS 'brickhouse.udf.collect.ArrayIndexUDF';
add jar /home/cloudera/hivexmlserde-1.0.5.3.jar;
set hive.exec.mode.local.auto=false;
DROP TABLE IF EXISTS medinfo2;
create table medinfo2 as
select array_index(statusCode,n) AS statusCode,
array_index(startTime,n) AS startTime,
array_index(endTime,n) AS endTime,
array_index(strengthValue,n) AS strengthValue,
array_index(strengthUnits,n) AS strengthUnits
from medications_info7 lateral view numeric_range(size( statusCode )) n1 as n;
Error:
Caused by: java.lang.IndexOutOfBoundsException: Index: 7, Size: 7 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at com.ibm.spss.hive.serde2.xml.objectinspector.XmlListObjectInspector.getListElement(XmlListObjectInspector.java:79) at brickhouse.udf.collect.ArrayIndexUDF.evaluate(ArrayIndexUDF.java:59) at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:186) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluatorHead._evaluate(ExprNodeEvaluatorHead.java:44) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77) at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:77) ... 25 more
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce Jobs Launched: Stage-Stage-1: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec
Sample:
<document>
<code>10160-0</code>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20110729</startTime>
<endTime>20110822</endTime>
<strengthValue>24</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20120130</startTime>
<endTime>20120326</endTime>
<strengthValue>12</strengthValue>
<strengthUnits>h</strengthUnits>
</entryInfo>
<entryInfo>
<statusCode>completed</statusCode>
<startTime>20100412</startTime>
<endTime>20110822</endTime>
<strengthValue>8</strengthValue>
<strengthUnits>d</strengthUnits>
</entryInfo>
</document>
My actual sample is huge in size and contains a lot of these tags which are repeated.
回答1:
I don't know what your data looks like in Hive because you didn't provide that information so here is how I loaded your XML into Hive.
Loader:
ADD JAR /path/to/jar/hivexmlserde-1.0.5.3.jar;
DROP TABLE IF EXISTS db.tbl;
CREATE TABLE IF NOT EXISTS db.tbl (
code STRING,
entryInfo ARRAY<MAP<STRING,STRING>>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerde'
WITH SERDEPROPERTIES (
"column.xpath.code"="/document/code/text()",
"column.xpath.entryInfo"="/document/entryInfo/*"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
TBLPROPERTIES (
"xmlinput.start"="<document>",
"xmlinput.end"="</document>"
);
LOAD DATA LOCAL INPATH 'someFile.xml' INTO TABLE db.tbl;
In the Hive-XML-SerDe documentation under section 3 - Arrays, you can see that they use an array structure to handle repeated tags and in 4 - Maps, you can see that they use maps to handle entries under a sub-tag. So, entryInfo
will be of type ARRAY<MAP<STRING,STRING>>
.
You can then explode this array, collect like key/vals, and re-combine.
Query:
ADD JAR /path/to/jar/hivexmlserde-1.0.5.3.jar;
ADD JAR /path/to/jars/brickhouse-0.7.1.jars;
CREATE TEMPORARY FUNCTION COLLECT AS 'brickhouse.udf.collect.CollectUDAF';
SELECT code
, m_map['statusCode'] AS status_code
, m_map['startTime'] AS start_time
, m_map['endTime'] AS end_time
, m_map['strengthValue'] AS strength_value
, m_map['strengthUnits'] AS strength_units
FROM (
SELECT code
, COLLECT(m_keys, m_vals) AS m_map
FROM (
SELECT code
, idx
, MAP_KEYS(entry_info_map)[0] AS m_keys
, MAP_VALUES(entry_info_map)[0] AS m_vals
FROM (
SELECT code
, entry_info_map
, CASE
WHEN FLOOR(tmp / 5) = 0 THEN 0
WHEN FLOOR(tmp / 5) = 1 THEN 1
WHEN FLOOR(tmp / 5) = 2 THEN 2
ELSE -1
END AS idx
FROM db.tbl
LATERAL VIEW POSEXPLODE(entryInfo) exptbl AS tmp, entry_info_map ) x ) y
GROUP BY code, idx ) z
Output:
code status_code start_time end_time strength_value strength_units
10160-0 completed 20110729 20110822 24 h
10160-0 completed 20120130 20120326 12 h
10160-0 completed 20100412 20110822 8 d
Also, you've basically asked this question 4 times (one, two, three, four). This is not a good idea. Just ask once, edit to add more information, and be patient.
来源:https://stackoverflow.com/questions/41481582/exception-while-using-lateral-view-in-hive