Sqoop import is resulting in duplicate/partial records when we are using the following setting
--query
- Custom Query--split-by
- Non-integer column (char)--num-mappers
- More than 2
Verified the source data count say 1000 records
Verified the import data count say 1923 records
When using the split-by
and field is non integer .
Sqoop uses TextSplitter which provides a warning as follows :
WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records
WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
- solution 1: use single mapper or 2
- solution 2: use rank function in the query and use the
--split-by
on the rank field - solution 3: sort the
--split-by
field in ascending order in the query
来源:https://stackoverflow.com/questions/32197895/partial-and-duplicate-records-while-sqoop-import