I have an R dataframe whith 2 fields:
ID WORD
1 AAAAABBBBB
2 ABCAAABBBDDD
3 ...
I'd like to simplify the words with repeating letters by keeping only the letter and not the duplicates in a repetition:
e.g.: AAAAABBBBB
should give me AB
and ABCAAABBBDDD
should give me ABCABD
Anyone has an idea on how to do this?
Here's a solution with regex:
x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
gsub("([A-Za-z])\\1+","\\1",x)
EDIT: By request, some benchmarking. I added Matthew Lundberg's pattern in the comment, matching any character. It appears that gsub
is faster by an order of magnitude, and matching any character is faster than matching letters.
library(microbenchmark)
set.seed(1)
##create sample dataset
x <- apply(
replicate(100,sample(c(LETTERS[1:3],""),10,replace=TRUE))
,2,paste0,collapse="")
##benchmark
xm <- microbenchmark(
SAPPLY = sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
,GSUB.LETTER = gsub("([A-Za-z])\\1+","\\1",x)
,GSUB.ANY = gsub("(.)\\1+","\\1",x)
)
##print results
print(xm)
# Unit: milliseconds
# expr min lq median uq max
# 1 GSUB.ANY 1.433873 1.509215 1.562193 1.664664 3.324195
# 2 GSUB.LETTER 1.940916 2.059521 2.108831 2.227435 3.118152
# 3 SAPPLY 64.786782 67.519976 68.929285 71.164052 77.261952
##boxplot of times
boxplot(xm)
##plot with ggplot2
library(ggplot2)
qplot(y=time, data=xm, colour=expr) + scale_y_log10()
x <- c('AAAAABBBBB', 'ABCAAABBBDDD')
sapply(strsplit(x, ''), function(x) paste0(rle(x)$values, collapse=''))
## [1] "AB" "ABCABD"
来源:https://stackoverflow.com/questions/14159364/reduce-string-length-by-removing-contiguous-duplicates