Sqoop import : composite primary key and textual primary key

后端 未结 1 677
别那么骄傲
别那么骄傲 2020-11-27 07:18

Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1

The source DB schema is on sql server and it contains several tables which either have primary key as :

相关标签:
1条回答
  • 2020-11-27 07:58

    Specify split column manually. Split column is not necessarily equal to PK. You can have complex PK and some int Split column. You can specify any integer column or even simple function (some simple function like substring or cast, not aggregation or analytic). Split column preferably should be evenly distributed integer.

    For example if your split column contains few rows with value -1 and 10M rows with values 10000 - 10000000 and num-mappers=8 then sqoop will split the dataset between mappers not evenly:

    • 1st mapper will get few rows with -1,
    • 2nd-7th mappers will get 0 rows,
    • 8th mapper will get almost 10M rows,

    that will result in data skewing and 8th mapper will run for ever or even fail. And I have also got duplicates when used non-integer split-column with MS-SQL. So, use integer split-column. In your case with table with only two varchar columns you can either

    (1) add surrogate int PK and use it also as a split or

    (2) split your data manually using custom query with WHERE clause and run sqoop few times with num-mappers=1, or

    (3) apply some deterministic Integer non-aggregation function to you varchar column, for example cast(substr(...) as int) or second(timestamp_col) or datepart(second, date), etc. as split-column. For Teradata you can use AMP number: HASHAMP (HASHBUCKET (HASHROW (string_column_list))) to get integer AMP number from list of not integer keys.

    0 讨论(0)
提交回复
热议问题