Hive: parsing JSON

前端 未结 4 2038
遥遥无期
遥遥无期 2021-02-04 04:11

I am trying to get some values out of nested JSON for millions of rows (5 TB+ table). What is the most efficient way to do this?

Here is an example:

{\"c         


        
相关标签:
4条回答
  • 2021-02-04 04:39

    Implementing a SerDe to parse your data in JSON is a better way for your case.

    A tutorial on how to implement SerDe for parsing JSON can be found here

    http://blog.cloudera.com/blog/2012/12/how-to-use-a-serde-in-apache-hive/

    You can use the following sample SerDe implementation as well

    https://github.com/rcongiu/Hive-JSON-Serde

    0 讨论(0)
  • 2021-02-04 04:40

    Using hive native json-serde('org.apache.hive.hcatalog.data.JsonSerDe') you can do this.. here are the steps

    ADD JAR /path/to/hive-hcatalog-core.jar;

    create a table as below 
     CREATE TABLE json_serde_nestedjson (
      country string,
      page int,
      data struct < ad: struct < impressions: struct < s:int, o:int  > > >
    )
    ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
    

    then load data(stored in file)

    LOAD DATA LOCAL INPATH '/tmp/nested.json' INTO TABLE json_serde_nestedjson;
    

    then get required data using

    SELECT country, page, data.ad.impressions.s, data.ad.impressions.o 
    FROM json_serde_nestedjson;  
    
    0 讨论(0)
  • 2021-02-04 04:41

    Here is what you can quickly try , I would suggest to use Json-Ser-De.

    nano /tmp/hive-parsing-json.json

    {"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}}
    

    Create base table :

    hive > CREATE TABLE hive_parsing_json_table ( json string );
    

    Load json file to Table :

    hive > LOAD DATA LOCAL INPATH  '/tmp/hive-parsing-json.json' INTO TABLE hive_parsing_json_table;
    

    Query the table :

    hive >  select v1.Country, v1.Page, v4.impressions_s, v4.impressions_o 
    from hive_parsing_json_table hpjp
         LATERAL VIEW json_tuple(hpjp.json, 'country', 'page', 'data') v1
         as Country, Page, data
         LATERAL VIEW json_tuple(v1.data, 'ad') v2
         as Ad
         LATERAL VIEW json_tuple(v2.Ad, 'impressions') v3
         as Impressions
         LATERAL VIEW json_tuple(v3.Impressions, 's' , 'o') v4
         as impressions_s,impressions_o;  
    

    Output :

    v1.country  v1.page     v4.impressions_s    v4.impressions_o
    US      227     10          10
    
    0 讨论(0)
  • 2021-02-04 04:45

    You can use get_json_object:

     select get_json_object(fieldname, '$.country'), 
            get_json_object(fieldname, '$.data.ad.s') from ... 
    

    You will get better performance with json_tuple but I found a "how to" to get the values in json inside json; To formating your table you can use something like this:

    from table t lateral view explode( split(regexp_replace(get_json_object(ln, ''$.data.ad.s'), '\\[|\\]', ''), ',' ) ) tb1 as s this code above will transform you "Array" in a column.

    form more: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

    I hope this help ...

    0 讨论(0)
提交回复
热议问题