Reduce string length by removing contiguous duplicates

I have an R dataframe whith 2 fields:

ID             WORD
1           AAAAABBBBB
2           ABCAAABBBDDD
3           ...

I'd like to simplify the words with repeating letters by keeping only the letter and not the duplicates in a repetition:

e.g.: AAAAABBBBB should give me AB and ABCAAABBBDDD should give me ABCABD

Anyone has an idea on how to do this?

Here's a solution with regex:

x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
gsub("([A-Za-z])\\1+","\\1",x)

EDIT: By request, some benchmarking. I added Matthew Lundberg's pattern in the comment, matching any character. It appears that gsub is faster by an order of magnitude, and matching any character is faster than matching letters.

library(microbenchmark)
set.seed(1)
##create sample dataset
x <- apply(
  replicate(100,sample(c(LETTERS[1:3],""),10,replace=TRUE))
,2,paste0,collapse="")
##benchmark
xm <- microbenchmark(
    SAPPLY = sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
    ,GSUB.LETTER = gsub("([A-Za-z])\\1+","\\1",x)
    ,GSUB.ANY = gsub("(.)\\1+","\\1",x)
)
##print results
print(xm)
# Unit: milliseconds
         # expr       min        lq    median        uq       max
# 1    GSUB.ANY  1.433873  1.509215  1.562193  1.664664  3.324195
# 2 GSUB.LETTER  1.940916  2.059521  2.108831  2.227435  3.118152
# 3      SAPPLY 64.786782 67.519976 68.929285 71.164052 77.261952

##boxplot of times
boxplot(xm)
##plot with ggplot2
library(ggplot2)
qplot(y=time, data=xm, colour=expr) + scale_y_log10()

x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
## [1] "AB"     "ABCABD"

来源：https://stackoverflow.com/questions/14159364/reduce-string-length-by-removing-contiguous-duplicates

标签

string

dataframe

dimensionality-reduction