Stack : Installed HDP-2.3.2.0-2950 using Ambari 2.1
The source DB schema is on sql server and it contains several tables which either have primary key as :
Specify split column manually. Split column is not necessarily equal to PK. You can have complex PK and some int Split column. You can specify any integer column or even simple function (some simple function like substring or cast, not aggregation or analytic). Split column preferably should be evenly distributed integer.
For example if your split column contains few rows with value -1 and 10M rows with values 10000 - 10000000 and num-mappers=8 then sqoop will split the dataset between mappers not evenly:
that will result in data skewing and 8th mapper will run for ever or even fail. And I have also got duplicates when used non-integer split-column with MS-SQL. So, use integer split-column. In your case with table with only two varchar columns you can either
(1) add surrogate int PK and use it also as a split or
(2) split your data manually using custom query with WHERE
clause and run sqoop few times with num-mappers=1, or
(3) apply some deterministic Integer non-aggregation function to you varchar column, for example cast(substr(...) as int) or second(timestamp_col)
or datepart(second, date)
, etc. as split-column.
For Teradata you can use AMP number: HASHAMP (HASHBUCKET (HASHROW (string_column_list)))
to get integer AMP number from list of not integer keys.