I am working in SAS and I have a data-set with 2 columns and I want not only to remove the duplicates, but also the "almost" duplicates. The data looks like this:
**Brand Product** Coca Cola Coca Cola Light Coca Cola Coca Cola Lgt Coca Cola Cocacolalight Coca Cola Coca Cola Vanila Pepsi Pepsi Zero Pepsi Pepsi Zro
i do not know if it is actually possible, but what I would like the file to look like after removing the "duplicates", is like that:
**Brand Product** Coca Cola Coca Cola Light Coca Cola Coca Cola Vanila Pepsi Pepsi Zero
I don't have a preference if the final table will have e.g. "Pepsi Zero" or "Pepsi Zro" as long as there are no "duplicate" values.
I was thinking if there was a way to compare the e.g. first 4-5 letters and if they are the same then to consider them as duplicates. But of course I am open to suggestions. If there is a way to be done even in excel I would be interested to hear it.
I'm going to start by straight up quoting Jeff's answer :
SAS has at least a couple functions for calculating edit distance between two strings:
Compged, for general edit distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm
Complev, for Levenshtein distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm
There's also the spedis()
function for comparing edit distances.
Now those are all great, but my personal favorite is the soundex()
function which will allow you to test if two words 'sound' the same. It's not going to be 100% correct but in this case the results work alright.
First some data:
Data HAVE; attrib name length=$20 alt_name length=$20; infile datalines dsd dlm=',' truncover; input name $ alt_name $; datalines; Coca Cola ,Coca Cola Light Coca Cola ,Coca Cola Lgt Coca Cola ,Cocacolalight Coca Cola ,Coca Cola Vanila Pepsi ,Pepsi Zero Pepsi ,Pepsi Zro ; Run;
Get every combination of words that we want to compare, and calculate the soundex()
s for eyeballing:
proc sql noprint; create table cartesian as select a.name, a.alt_name as alt_name1, b.alt_name as alt_name2, soundex(a.alt_name) as soundex_a, soundex(b.alt_name) as soundex_b from have a, have b where a.name = b.name and soundex(a.alt_name) eq soundex(b.alt_name) ; quit;
Now I'll leave it up to use as an exercise to dedupe the resulting list. But basically this will tell you which words match up. If you get false-positives for the matches, just add them to an exception list to manually transform those particular values.