问题
How can use fuzzy matching in pandas to detect duplicate rows (efficiently)
How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?
回答1:
Not pandas specific, but within the python ecosystem the dedupe python library would seem to do what you want. In particular, it allows you to compare each column of a row separately and then combine the information into a single probability score of a match.
回答2:
There is now a package to make it easier to use the dedupe library with pandas: pandas-dedupe
(I am a developer of the original dedupe library, but not the pandas-dedupe package)
来源:https://stackoverflow.com/questions/39490190/pandas-fuzzy-detect-duplicates