I am trying to get some values out of nested JSON for millions of rows (5 TB+ table). What is the most efficient way to do this?
Here is an example:
{\"c
Here is what you can quickly try , I would suggest to use Json-Ser-De.
nano /tmp/hive-parsing-json.json
{"country":"US","page":227,"data":{"ad":{"impressions":{"s":10,"o":10}}}}
Create base table :
hive > CREATE TABLE hive_parsing_json_table ( json string );
Load json file to Table :
hive > LOAD DATA LOCAL INPATH '/tmp/hive-parsing-json.json' INTO TABLE hive_parsing_json_table;
Query the table :
hive > select v1.Country, v1.Page, v4.impressions_s, v4.impressions_o
from hive_parsing_json_table hpjp
LATERAL VIEW json_tuple(hpjp.json, 'country', 'page', 'data') v1
as Country, Page, data
LATERAL VIEW json_tuple(v1.data, 'ad') v2
as Ad
LATERAL VIEW json_tuple(v2.Ad, 'impressions') v3
as Impressions
LATERAL VIEW json_tuple(v3.Impressions, 's' , 'o') v4
as impressions_s,impressions_o;
Output :
v1.country v1.page v4.impressions_s v4.impressions_o
US 227 10 10