hive xml serDe : table is empty

前端 未结 5 536
我在风中等你
我在风中等你 2021-01-21 12:38

I want to store xml data into hive table, XML data :


   1266 
     

        
相关标签:
5条回答
  • 2021-01-21 12:54

    According to Hive DDL documentation, LOCATION clause expects an hdfs_path. Hence, try specifying only the directory, not the whole path to your XML file. By using LOAD after CREATE TABLE, you cannot have external tables, which might be an interesting approach in some cases.

    Reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/TruncateTable

    0 讨论(0)
  • 2021-01-21 12:57

    The entire XML file should be a single line (i.e. no newlines in the XML). (A simple unix command to strip newlines is tr '\n\r' ' ' < source.xml > processed.xml.)

    https://github.com/dvasilen/Hive-XML-SerDe/wiki/XML-data-sources

    0 讨论(0)
  • 2021-01-21 13:01

    Well, the code looks good. As per the example in this link, it should work for you.

    Btw, there is a typo in the code that you have provided. In the table definition status_test STRING should be status_text STRING or vice versa.

    0 讨论(0)
  • 2021-01-21 13:02

    LOCATION give directory only instead of file

     create table xml_AIR(id STRING, status_text STRING,display_name STRING ,host_name STRING)
    row format serde 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
    with serdeproperties(
    "column.xpath.id"="/servicestatus/@id",
    "column.xpath.status_text"="/servicestatus/status_text/text()",
    "column.xpath.display_name"="/servicestatus/display_name/text()",
    "column.xpath.host_name"="/servicestatus/host_name/text()"
    )
    stored as
    inputformat 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    outputformat 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    LOCATION  '/user/cloudera/input'
    tblproperties(
    "xmlinput.start"="<servicestatus",
    "xmlinput.end"="</servicestatus>"
    );
    
    0 讨论(0)
  • 2021-01-21 13:14

    I came out through the same Problem when dealing with XML Serde. After some struggle, I fixed it by using the "Load data" statement separately and avoiding addition of "LOCATION" property in "CREATE" statement. the following is my XML data.

    <record customer_id="0000-JTALA">
            <income>200000</income>     
            <demographics>
                <gender>F</gender>
                <agecat>1</agecat>
                <edcat>1</edcat>
                <jobcat>2</jobcat>
                <empcat>2</empcat>
                <retire>0</retire>
                <jobsat>1</jobsat>
                <marital>1</marital>
                <spousedcat>1</spousedcat>
                <residecat>4</residecat>
                <homeown>0</homeown>
                <hometype>2</hometype>
                <addresscat>2</addresscat>
            </demographics>
            <financial>
                <income>18</income>
                <creddebt>1.003392</creddebt>
                <othdebt>2.740608</othdebt>
                <default>0</default>
            </financial>
        </record>
    

    CREATE TABLE Statement:

    CREATE TABLE xml_bank(customer_id STRING, income BIGINT, demographics map<string,string>, financial map<string,string>)
    ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
    WITH SERDEPROPERTIES (
    "column.xpath.customer_id"="/record/@customer_id",
    "column.xpath.income"="/record/income/text()",
    "column.xpath.demographics"="/record/demographics/*",
    "column.xpath.financial"="/record/financial/*"
    )
    STORED AS
    INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    TBLPROPERTIES (
    "xmlinput.start"="<record customer",
    "xmlinput.end"="</record>"
    );
    

    CREATE Query Result:

    OK
    Time taken: 0.925 seconds
    hive>
    

    for the above create statement, I used the following "LOAD DATA" statement to load the data contained in an XML file in to the above created table.

    hive> load data local inpath '/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml' overwrite into table xml_bank6;
    

    LOAD Query Result:

    Copying data from file:/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml
    Copying file: file:/home/mahesh/hive_input_datasets/XMLdata/XMLdatafile.xml
    Loading data to table default.xml_bank6
    Table default.xml_bank6 stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 500, raw_data_size: 0]
    OK
    Time taken: 0.879 seconds
    hive>
    

    And finally,

    SELECT Query and Result:

    hive> select * from xml_bank6;
    OK
    0000-JTALA  200000  {"empcat":"2","jobcat":"2","residecat":"4","retire":"0","hometype":"2","addresscat":"2","homeown":"0","spousedcat":"1","gender":"F","jobsat":"1","edcat":"1","marital":"1","agecat":"1"}    {"default":"0","income":"18","othdebt":"2.740608","creddebt":"1.003392"}
    Time taken: 0.149 seconds, Fetched: 1 row(s)
    hive>
    

    And in the above query i would suggest the value for "xmlinput.start" as "<servicestatus id", instead of "<servicestatus",because the XML start tag is in the pattern <servicestatus id="some data">.I believe this would be helpful for you.

    0 讨论(0)
提交回复
热议问题