I am using the below code to parse xml data in Hive. In my xml data, a few tags are repeating so I am using the brickhouse jar and lateral view to parse the tags and place
I don't know what your data looks like in Hive because you didn't provide that information so here is how I loaded your XML into Hive.
Loader:
ADD JAR /path/to/jar/hivexmlserde-1.0.5.3.jar;
DROP TABLE IF EXISTS db.tbl;
CREATE TABLE IF NOT EXISTS db.tbl (
code STRING,
entryInfo ARRAY
In the Hive-XML-SerDe documentation under section 3 - Arrays, you can see that they use an array structure to handle repeated tags and in 4 - Maps, you can see that they use maps to handle entries under a sub-tag. So, entryInfo
will be of type ARRAY
.
You can then explode this array, collect like key/vals, and re-combine.
Query:
ADD JAR /path/to/jar/hivexmlserde-1.0.5.3.jar;
ADD JAR /path/to/jars/brickhouse-0.7.1.jars;
CREATE TEMPORARY FUNCTION COLLECT AS 'brickhouse.udf.collect.CollectUDAF';
SELECT code
, m_map['statusCode'] AS status_code
, m_map['startTime'] AS start_time
, m_map['endTime'] AS end_time
, m_map['strengthValue'] AS strength_value
, m_map['strengthUnits'] AS strength_units
FROM (
SELECT code
, COLLECT(m_keys, m_vals) AS m_map
FROM (
SELECT code
, idx
, MAP_KEYS(entry_info_map)[0] AS m_keys
, MAP_VALUES(entry_info_map)[0] AS m_vals
FROM (
SELECT code
, entry_info_map
, CASE
WHEN FLOOR(tmp / 5) = 0 THEN 0
WHEN FLOOR(tmp / 5) = 1 THEN 1
WHEN FLOOR(tmp / 5) = 2 THEN 2
ELSE -1
END AS idx
FROM db.tbl
LATERAL VIEW POSEXPLODE(entryInfo) exptbl AS tmp, entry_info_map ) x ) y
GROUP BY code, idx ) z
Output:
code status_code start_time end_time strength_value strength_units
10160-0 completed 20110729 20110822 24 h
10160-0 completed 20120130 20120326 12 h
10160-0 completed 20100412 20110822 8 d
Also, you've basically asked this question 4 times (one, two, three, four). This is not a good idea. Just ask once, edit to add more information, and be patient.