问题
Sqoop: How to deal with duplicate values while importing data from RDBMS to Hive tables.
Or to deal with redundancy option if the values are already available in Hive Tables?
回答1:
If your data has a unique identifier and you are running incremental imports you can specify it on the -mergeKey value of the import. This will merge the values that where already on the table with the newest one. The newer will override the oldest.
If you are not running incremental imports you can use sqoop merge to unify data. From sqoop docs :
When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with --merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur.
The important is that you do have a single unique primary key for each record. Otherwise you might generate one when importing the data. To do so you could generate the import with the --query and generate the new column with the unique key on the select of the data concatenating existing columns until you get a unique combination.
--query "SELECT CONVERT(VARCHAR(128), [colum1]) + '_' + CONVERT(VARCHAR(128), [column2]) AS CompoundKey ,* FROM [dbo].[tableName] WHERE \$CONDITIONS" \
回答2:
There is no direct option from sqoop
that will provide the solution that you are looking for. You will have to set up EDW kind of process to achieve your goal:
- import data in staging table(hive - create staging database for this purpose) - this should be copy of target table, but data type may vary as per your transformations requirements.
load data from staging database table(hive) to target database table(hive) by doing transformations. in your case:
Insert into table trgt.table select * from stg.table stg_tbl where stg_tbl.col1 not in (select col1 from trgt.table);
here
trgt
is target database,stg
is staging database - both are in hive.
来源:https://stackoverflow.com/questions/38418523/sqoop-how-to-deal-with-duplicate-values-while-importing-data-from-rdbms-to-hive