I have an array of 800 sentences. I want to remove all duplicates (sentences that have the same exact words, but in different order) from the array. So for example \"this is
Use an Object as a lookup to get a quick hashtable-backed check. That means using string as your key type, which means normalising the case/ordering/etc of the words first to get a unique key for each combination of words.
// Get key for sentence, removing punctuation and normalising case and word order
// eg 'Hello, a horse!' -> 'x_a hello horse'
// the 'x_' prefix is to avoid clashes with any object properties with undesirable
// special behaviour (like prototype properties in IE) and get a plain lookup
//
function getSentenceKey(sentence) {
var trimmed= sentence.replace(/^\s+/, '').replace(/\s+$/, '').toLowerCase();
var words= trimmed.replace(/[^\w\s]+/g, '').replace(/\s+/, ' ').split(' ');
words.sort();
return 'x_'+words.join(' ');
}
var lookup= {};
for (var i= sentences.length; i-->0;) {
var key= getSentenceKey(sentences[i]);
if (key in lookup)
sentences.splice(i, 1);
else
lookup[key]= true;
}
Would need some work if you need to support non-ASCII characters (\w
doesn't play well with Unicode in JS, and the question of what constitutes a word in some languages is a difficult one). Also, is "foo bar foo" the same sentence as "bar bar foo"?