Insert overwrite on partitioned table is not deleting the existing data

被刻印的时光 ゝ 提交于 2020-06-08 20:01:28

问题


I am trying to run insert overwrite over a partitioned table. The select query of insert overwrite omits one partition completely. Is it the expected behavior?

Table definition

CREATE TABLE `cities_red`(                                                              
  `cityid` int,                                                                         
  `city` string)                                                                        
PARTITIONED BY (                                                                        
  `state` string)                                                                       
ROW FORMAT SERDE                                                                        
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'                                           
STORED AS INPUTFORMAT                                                                   
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'                                     
OUTPUTFORMAT                                                                            
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'                                    
TBLPROPERTIES (                                                                         
  'auto.purge'='true',                                                                  
  'last_modified_time'='1555591782',                                                    
  'transient_lastDdlTime'='1555591782');  

Table Data

+--------------------+------------------+-------------------+--+
| cities_red.cityid  | cities_red.city  | cities_red.state  |
+--------------------+------------------+-------------------+--+
| 13                 | KARNAL           | HARYANA           |
| 13                 | KARNAL           | HARYANA           |
| 1                  | Nagpur           | MH                |
| 22                 | Mumbai           | MH                |
| 22                 | Mumbai           | MH                |
| 755                | BPL              | MP                |
| 755                | BPL              | MP                |
| 10                 | BANGLORE         | TN                |
| 10                 | BANGLORE         | TN                |
| 10                 | BANGLORE         | TN                |
| 10                 | BANGLORE         | TN                |
| 12                 | NOIDA            | UP                |
| 12                 | NOIDA            | UP                |
+--------------------+------------------+-------------------+--+

Queries

insert overwrite table cities_red partition (state) select * from cities_red where city !='NOIDA';

It does not delete any data from the table

insert overwrite table cities_red partition (state) select * from cities_red where city !='Mumbai';

It removes the expected 2 rows from the table.

Is this an expected behavior from Hive in case of partitioned tables?


回答1:


Yes, this is expected behavior.

Insert overwrite table partition select ,,, overwrites only partitions existing in the dataset returned by select.

In your example partition state=UP has records with city='NOIDA' only. Filter where city !='NOIDA' removes entire state=UP partition from the returned dataset and this is why it is not being rewritten.

Filter city !='Mumbai' does not filter entire partition, it is partially returned, this is why it is being overwritten with filtered data.

It works as designed. Consider scenario when you need to overwrite only desired partitions, this is quite normal for the incremental partition load. You do not need to touch other partitions in this case. You need to be able normally to overwrite only desired partitions. And without overwriting unchanged partitions, which can be very expensive to recover.

And if you still want to drop partitions and modify data in existing partitions, then you can drop/create table (you may need to create one more intermediate table for this) and then load partitions into it. Or alternatively calculate partitions which you need to drop separately and execute ALTER TABLE DROP PARTITION.



来源:https://stackoverflow.com/questions/55758708/insert-overwrite-on-partitioned-table-is-not-deleting-the-existing-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!