Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables

笑着哭i 提交于 2020-01-17 06:18:56

问题


Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables.

Or to deal with redundancy option if the values are already available in Hive Tables?


回答1:


If your data has a unique identifier and you are running incremental imports you can specify it on the -mergeKey value of the import. This will merge the values that where already on the table with the newest one. The newer will override the oldest.

If you are not running incremental imports you can use sqoop merge to unify data. From sqoop docs :

When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with --merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur.

The important is that you do have a single unique primary key for each record. Otherwise you might generate one when importing the data. To do so you could generate the import with the --query and generate the new column with the unique key on the select of the data concatenating existing columns until you get a unique combination.

--query "SELECT CONVERT(VARCHAR(128), [colum1]) + '_' + CONVERT(VARCHAR(128), [column2]) AS CompoundKey ,* FROM [dbo].[tableName] WHERE \$CONDITIONS" \



回答2:


There is no direct option from sqoop that will provide the solution that you are looking for. You will have to set up EDW kind of process to achieve your goal:

  1. import data in staging table(hive - create staging database for this purpose) - this should be copy of target table, but data type may vary as per your transformations requirements.
  2. load data from staging database table(hive) to target database table(hive) by doing transformations. in your case:

    Insert into table trgt.table 
    select * from stg.table stg_tbl 
    where stg_tbl.col1 not in (select col1 from trgt.table);
    

    here trgt is target database, stg is staging database - both are in hive.



来源:https://stackoverflow.com/questions/38418523/sqoop-how-to-deal-with-duplicate-values-while-importing-data-from-rdbms-to-hive

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!