I am trying to get some values out of nested JSON for millions of rows (5 TB+ table). What is the most efficient way to do this?
Here is an example:
Using hive native json-serde('org.apache.hive.hcatalog.data.JsonSerDe')
you can do this.. here are the steps
ADD JAR /path/to/hive-hcatalog-core.jar;
create a table as below
CREATE TABLE json_serde_nestedjson (
country string,
page int,
data struct < ad: struct < impressions: struct < s:int, o:int > > >
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
then load data(stored in file)
LOAD DATA LOCAL INPATH '/tmp/nested.json' INTO TABLE json_serde_nestedjson;
then get required data using
SELECT country, page, data.ad.impressions.s, data.ad.impressions.o
FROM json_serde_nestedjson;