问题
I have a table containing over 50 columns (both numeric and char), is there a way to get the overall statistics without specifying each column?
As an example:
a b c d
1 2 3 4
5 6 7 8
9 10 11 12
Ideally I would have something like:
column_name min avg max sum
a 1 5 9 15
b 2 6 10 18
c 3 7 11 21
d 4 8 12 24
Nevertheless, getting one aggregate at a time it would be more more than helpful.
Any help/idea would be highly appreciated.
Thank you,
O
回答1:
You can parse DESCRIBE TABLE output using AWK and generate comma separated string of SUM(col) as sum_col for numeric columns and column_list for all other columns. In this example it generates select statement with goup by. Run in shell:
TABLE_NAME=your_schema.your_table
NUMERIC_COLUMNS=$(hive -S -e "set hive.cli.print.header=false; describe ${TABLE_NAME};" | awk -F " " 'f&&!NF{exit}{f=1}f{ if($2=="int"||$2=="double") printf c "sum("toupper($1)") as sum_"$1}{c=","}')
GROUP_BY_COLUMNS=$(hive -S -e "set hive.cli.print.header=false; describe ${TABLE_NAME};" | awk -F " " 'f&&!NF{exit}{f=1}f{if($2!="int"&&$2!="double") printf c toupper($1)}{c=","}')
SELECT_STATEMENT="select $NUMERIC_COLUMNS $GROUP_BY_COLUMNS from $TABLE_NAME group by $GROUP_BY_COLUMNS"
I'm checking only int and double columns. You add more types. Also you can optimize it and execute DESCRIBE only once, then parse result using same AWK scripts. Hope you got the idea.
来源:https://stackoverflow.com/questions/58008031/hive-is-there-a-way-to-get-the-aggregates-of-all-the-numeric-columns-existing-i