hiveql

How to sort an array and return the index in hive?

Deadly 提交于 2021-02-08 05:28:51
问题 In hive, I wish to sort an array from largest to smallest, and get the index array. For example, the table is like this: id | value_array 1 | {30, 40, 10, 20} 2 | {10, 30, 40, 20} I with to get this: id | value_array 1 | {1, 0, 3, 2} 2 | {2, 1, 3, 0} The arries in result are the index of the initial elements. How can I achieve this? 回答1: Explode array using posexplode to get index and value, sort by value, collect array of index: select id, collect_list(pos) as result_array from ( select s.id

Select if table exists in Apache Hive

坚强是说给别人听的谎言 提交于 2021-02-07 14:49:24
问题 I have a hive query which is of the format, select . . . from table1 left join (select . . . from table2) on (some_condition) The table2 might not be present depending on the environment. So I would like to join if only table2 is present otherwise just ignore the subquery. The below query returns the table_name if it exists, show tables in {DB_NAME} like '{table_name}' But I dont know how I can integrate this into my query to select only if it exists. Is there a way in hive query to check if

Select if table exists in Apache Hive

穿精又带淫゛_ 提交于 2021-02-07 14:48:19
问题 I have a hive query which is of the format, select . . . from table1 left join (select . . . from table2) on (some_condition) The table2 might not be present depending on the environment. So I would like to join if only table2 is present otherwise just ignore the subquery. The below query returns the table_name if it exists, show tables in {DB_NAME} like '{table_name}' But I dont know how I can integrate this into my query to select only if it exists. Is there a way in hive query to check if

Impala: Show tables like query

南楼画角 提交于 2021-02-07 14:45:47
问题 I am working with Impala and fetching the list of tables from the database with some pattern like below. Assume i have a Database bank , and tables under this database are like below. cust_profile cust_quarter1_transaction cust_quarter2_transaction product_cust_xyz .... .... etc Now i am filtering like show tables in bank like '*cust*' It is returning the expected results like, which are the tables has a word cust in its name. Now my requirement is i want all the tables which will have cust

Primary keys and indexes in Hive query language is poosible or not?

限于喜欢 提交于 2021-02-05 08:50:08
问题 We are trying to migrate oracle tables to hive and process them. Currently the tables in oracle has primary key,foreign key and unique key constraints. Can we replicate the samein hiveql? We are doing some analysis on how to implement it. 回答1: Hive indexing was introduced in Hive 0.7.0 (HIVE-417) and removed in Hive 3.0 (HIVE-18448) Please read comments in this Jira. The feature was completely useless in Hive. These indexes was too expensive for big data, RIP. As of Hive 2.1.0 (HIVE-13290)

Extracting strings between distinct characters using hive SQL

纵饮孤独 提交于 2021-02-05 08:28:06
问题 I have a field called geo_data_display which contains country, region and dma. The 3 values are contained between = and & characters - country between the first "=" and the first "&", region between the second "=" and the second "&" and DMA between the third "=" and the third "&". Here's a re-producible version of the table. country is always character but region and DMA can be either numeric or character and DMA doesn't exist for all countries. A few sample values are: country=us&region=tx

Hive number of reducers in group by and count(distinct)

梦想与她 提交于 2021-02-04 21:09:57
问题 I was told that count(distinct ) may result in data skew because only one reducer is used. I made a test using a table with 5 billion data with 2 queries, Query A: select count(distinct columnA) from tableA Query B: select count(columnA) from (select columnA from tableA group by columnA) a Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected. However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the

Hive number of reducers in group by and count(distinct)

老子叫甜甜 提交于 2021-02-04 21:09:17
问题 I was told that count(distinct ) may result in data skew because only one reducer is used. I made a test using a table with 5 billion data with 2 queries, Query A: select count(distinct columnA) from tableA Query B: select count(columnA) from (select columnA from tableA group by columnA) a Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected. However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the

Hive number of reducers in group by and count(distinct)

守給你的承諾、 提交于 2021-02-04 21:09:04
问题 I was told that count(distinct ) may result in data skew because only one reducer is used. I made a test using a table with 5 billion data with 2 queries, Query A: select count(distinct columnA) from tableA Query B: select count(columnA) from (select columnA from tableA group by columnA) a Actually, query A takes about 1000-1500 seconds while query B takes 500-900 seconds. The result seems expected. However, I realize that both queries use 370 mappers and 1 reducers and thay have almost the

Dynamic partitioning in Hive through the exact inserted timestamp

回眸只為那壹抹淺笑 提交于 2021-02-04 21:06:34
问题 I need to insert data to a given external table which should be partitioned by the inserted date. My question is how is Hive handling the timestamp generation? When I select a timestamp for all inserted records like this: WITH delta_insert AS ( SELECT trg.*, from_unixtime(unix_timestamp()) AS generic_timestamp FROM target_table trg ) SELECT * FROM delta_insert; Will the timestamp always be identical for all records, even if the query takes a lot of time to un? Or should I alternatively only