问题
Suppose I have an employee table with column ( emp_id, emp_name, emp_age , emp_update_ts ), updat_ts field is auto updated to current timestamp every time if there is an update on the table. now my question is :
When I update/insert the row in the table and run incremental sqoop import with lastmodified it will make my database and the hdfs in sync but "what if the data is deleted from the table and then I run the sqoop incremental with lastmodified option ? will it also take care of deleting the data from the HDFS to keep them in sync Or there is any other way which can handle the delete situation ?
回答1:
SQOOP Incremental does not support "DELETE" per say unless the records are marked in the source system as "deleted". As per a very interesting document produced by Hortonworks explains.
So I use incremental for any tables that don't ever get deleted records. You would be surprised how many of these exists in most transactional databases.
When I do have tables that might have "DELETE" (and I have no indicator) I import the entire dataset in the staging area and then run a "diff" query in HIVE to identify the deleted records.
See my post on HQL query to do this "diff".
回答2:
SQOOP is just an import/export tool. It does not check for any deleted data. Once your import/export process is completed then you have to compare both db manually by writing query.
OR if your dataset is small, then you can import in csv through impala and check.
回答3:
You can try parsing MySQL binlog to monitor deleted rows if you are using MySQL. Then use MySQL binlog + sqoop to extract data incrementally.
来源:https://stackoverflow.com/questions/41380740/sqoop-incremental-import-how-it-handles-the-data-when-a-row-is-deleted-from-the