问题:

I am working in SAS and I have a data-set with 2 columns and I want not only to remove the duplicates, but also the "almost" duplicates. The data looks like this:

**Brand        Product** Coca Cola    Coca Cola Light Coca Cola    Coca Cola Lgt Coca Cola    Cocacolalight Coca Cola    Coca Cola Vanila   Pepsi       Pepsi Zero   Pepsi       Pepsi Zro

i do not know if it is actually possible, but what I would like the file to look like after removing the "duplicates", is like that:

    **Brand        Product**     Coca Cola    Coca Cola Light     Coca Cola    Coca Cola Vanila       Pepsi       Pepsi Zero

I don't have a preference if the final table will have e.g. "Pepsi Zero" or "Pepsi Zro" as long as there are no "duplicate" values.

I was thinking if there was a way to compare the e.g. first 4-5 letters and if they are the same then to consider them as duplicates. But of course I am open to suggestions. If there is a way to be done even in excel I would be interested to hear it.

回答1:

I'm going to start by straight up quoting Jeff's answer :

SAS has at least a couple functions for calculating edit distance between two strings:

Compged, for general edit distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206133.htm

Complev, for Levenshtein distance: http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a002206137.htm

There's also the spedis() function for comparing edit distances.

Now those are all great, but my personal favorite is the soundex() function which will allow you to test if two words 'sound' the same. It's not going to be 100% correct but in this case the results work alright.

First some data:

Data HAVE;   attrib name length=$20 alt_name length=$20;   infile datalines dsd dlm=',' truncover;   input name $ alt_name $;   datalines; Coca Cola    ,Coca Cola Light Coca Cola    ,Coca Cola Lgt Coca Cola    ,Cocacolalight Coca Cola    ,Coca Cola Vanila Pepsi        ,Pepsi Zero Pepsi        ,Pepsi Zro ; Run;

Get every combination of words that we want to compare, and calculate the soundex()s for eyeballing:

proc sql noprint;   create table cartesian as   select a.name,          a.alt_name as alt_name1,          b.alt_name as alt_name2,          soundex(a.alt_name) as soundex_a,          soundex(b.alt_name) as soundex_b   from have a, have b   where a.name = b.name     and soundex(a.alt_name) eq soundex(b.alt_name)   ; quit;

Now I'll leave it up to use as an exercise to dedupe the resulting list. But basically this will tell you which words match up. If you get false-positives for the matches, just add them to an exception list to manually transform those particular values.

转载请标明出处:Removing “almost duplicates” using SAS or Excel

文章来源: Removing “almost duplicates” using SAS or Excel

标签

coca