Exception while using lateral view in Hive

后端 未结 1 932
长情又很酷
长情又很酷 2020-12-22 09:16

I am using the below code to parse xml data in Hive. In my xml data, a few tags are repeating so I am using the brickhouse jar and lateral view to parse the tags and place

1条回答
  •  醉梦人生
    2020-12-22 09:35

    I don't know what your data looks like in Hive because you didn't provide that information so here is how I loaded your XML into Hive.

    Loader:

    ADD JAR /path/to/jar/hivexmlserde-1.0.5.3.jar;
    
    DROP TABLE IF EXISTS db.tbl;
    CREATE TABLE IF NOT EXISTS db.tbl (
      code STRING,
      entryInfo ARRAY>
    )
    ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerde'
    WITH SERDEPROPERTIES (
      "column.xpath.code"="/document/code/text()",
      "column.xpath.entryInfo"="/document/entryInfo/*"
    )
    STORED AS
    INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    TBLPROPERTIES (
      "xmlinput.start"="",
      "xmlinput.end"=""
    );
    
    LOAD DATA LOCAL INPATH 'someFile.xml' INTO TABLE db.tbl;
    

    In the Hive-XML-SerDe documentation under section 3 - Arrays, you can see that they use an array structure to handle repeated tags and in 4 - Maps, you can see that they use maps to handle entries under a sub-tag. So, entryInfo will be of type ARRAY>.

    You can then explode this array, collect like key/vals, and re-combine.

    Query:

    ADD JAR /path/to/jar/hivexmlserde-1.0.5.3.jar;
    ADD JAR /path/to/jars/brickhouse-0.7.1.jars;
    
    CREATE TEMPORARY FUNCTION COLLECT AS 'brickhouse.udf.collect.CollectUDAF';
    
    SELECT code
      , m_map['statusCode']    AS status_code
      , m_map['startTime']     AS start_time
      , m_map['endTime']       AS end_time
      , m_map['strengthValue'] AS strength_value
      , m_map['strengthUnits'] AS strength_units
    FROM (
      SELECT code
        , COLLECT(m_keys, m_vals) AS m_map
      FROM (
        SELECT code
          , idx
          , MAP_KEYS(entry_info_map)[0]   AS m_keys
          , MAP_VALUES(entry_info_map)[0] AS m_vals
        FROM (
          SELECT code
            , entry_info_map
            , CASE
               WHEN FLOOR(tmp / 5) = 0 THEN 0
               WHEN FLOOR(tmp / 5) = 1 THEN 1
               WHEN FLOOR(tmp / 5) = 2 THEN 2
               ELSE -1
             END AS idx
          FROM db.tbl
          LATERAL VIEW POSEXPLODE(entryInfo) exptbl AS tmp, entry_info_map ) x ) y
      GROUP BY code, idx ) z
    

    Output:

    code    status_code     start_time      end_time    strength_value  strength_units
    10160-0 completed       20110729        20110822    24              h
    10160-0 completed       20120130        20120326    12              h
    10160-0 completed       20100412        20110822    8               d
    

    Also, you've basically asked this question 4 times (one, two, three, four). This is not a good idea. Just ask once, edit to add more information, and be patient.

    0 讨论(0)
提交回复
热议问题