Reduce string length by removing contiguous duplicates

拟墨画扇 提交于 2019-12-07 08:35:25

问题


I have an R dataframe whith 2 fields:

ID             WORD
1           AAAAABBBBB
2           ABCAAABBBDDD
3           ...

I'd like to simplify the words with repeating letters by keeping only the letter and not the duplicates in a repetition:

e.g.: AAAAABBBBB should give me AB and ABCAAABBBDDD should give me ABCABD

Anyone has an idea on how to do this?


回答1:


Here's a solution with regex:

x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
gsub("([A-Za-z])\\1+","\\1",x)

EDIT: By request, some benchmarking. I added Matthew Lundberg's pattern in the comment, matching any character. It appears that gsub is faster by an order of magnitude, and matching any character is faster than matching letters.

library(microbenchmark)
set.seed(1)
##create sample dataset
x <- apply(
  replicate(100,sample(c(LETTERS[1:3],""),10,replace=TRUE))
,2,paste0,collapse="")
##benchmark
xm <- microbenchmark(
    SAPPLY = sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
    ,GSUB.LETTER = gsub("([A-Za-z])\\1+","\\1",x)
    ,GSUB.ANY = gsub("(.)\\1+","\\1",x)
)
##print results
print(xm)
# Unit: milliseconds
         # expr       min        lq    median        uq       max
# 1    GSUB.ANY  1.433873  1.509215  1.562193  1.664664  3.324195
# 2 GSUB.LETTER  1.940916  2.059521  2.108831  2.227435  3.118152
# 3      SAPPLY 64.786782 67.519976 68.929285 71.164052 77.261952

##boxplot of times
boxplot(xm)
##plot with ggplot2
library(ggplot2)
qplot(y=time, data=xm, colour=expr) + scale_y_log10()



回答2:


x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
## [1] "AB"     "ABCABD"


来源:https://stackoverflow.com/questions/14159364/reduce-string-length-by-removing-contiguous-duplicates

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!