I am using Dedupe python package to check for duplicates for my incoming records. I have trained approx. 500000 records from a CSV file. Using the Dedupe package, I have cluster