determine frequency of string using grep [duplicate]

拥有回忆 提交于 2020-01-14 09:07:40

问题


if I have a vector

x <- c("ajjss","acdjfkj","auyjyjjksjj")

and do:

y <- x[grep("jj",x)]
table(y)

I get:

y
      ajjss auyjyjjksjj 
          1           1 

However the second string "auyjyjjksjj" should count the substring "jj" twice. How can I change this from a true/false computation, to actually counting the frequency of "jj"?

Also if for each string the frequency of the substring divided by the string's length could be calculated that would be great.

Thanks in advance.


回答1:


I solved this using gregexpr()

x <- c("ajjss","acdjfkj","auyjyjjksjj")
freq <- sapply(gregexpr("jj",x),function(x)if(x[[1]]!=-1) length(x) else 0)
df<-data.frame(x,freq)

df
#            x freq
#1       ajjss    1
#2     acdjfkj    0
#3 auyjyjjksjj    2

And for the last part of the question, calculating frequency / string length...

df$rate <- df$freq / nchar(as.character(df$x))

It is necessary to convert df$x back to a character string because data.frame(x,freq) automatically converts strings to factors unless you specify stringsAsFactors=F.

df
#            x freq      rate
#1       ajjss    1 0.2000000
#2     acdjfkj    0 0.0000000
#3 auyjyjjksjj    2 0.1818182



回答2:


You're using the wrong tool. Try gregexpr, which will give you the positions where the search string was found (or -1 if not found):

> gregexpr("jj", x, fixed = TRUE)
[[1]]
[1] 2
attr(,"match.length")
[1] 2
attr(,"useBytes")
[1] TRUE

[[2]]
[1] -1
attr(,"match.length")
[1] -1
attr(,"useBytes")
[1] TRUE

[[3]]
[1]  6 10
attr(,"match.length")
[1] 2 2
attr(,"useBytes")
[1] TRUE



回答3:


You can use qdap (though not in base install R):

x <- c("ajjss","acdjfkj","auyjyjjksjj")
library(qdap)
termco(x, seq_along(x), "jj")

## > termco(x, seq_along(x), "jj")
##   x word.count         jj
## 1 1          1 1(100.00%)
## 2 2          1          0
## 3 3          1 2(200.00%)

Note that the output has frequency and frequency compared to word count (the output is actually a list but prints a pretty output). To access the frequencies:

termco(x, seq_along(x), "jj")$raw

## > termco(x, seq_along(x), "jj")$raw
##   x word.count jj
## 1 1          1  1
## 2 2          1  0
## 3 3          1  2



回答4:


This simple one-liner in base r makes use of strsplit and then grepl, and is fairly robust, but will break if it has to count matches like jjjjjj as 3 lots of jj. The pattern match that makes this possible is from @JoshOBriens excellent Q&A:

sum( grepl( "jj" , unlist(strsplit( x , "(?<=.)(?=jj)" , perl = TRUE) ) ) )



# Examples....
f<- function(x){
    sum( grepl( "jj" , unlist(strsplit( x , "(?<=.)(?=jj)" , perl = TRUE) ) ) )
    }   

  #3 matches here
  xOP <- c("ajjss","acdjfkj","auyjyjjksjj")
  f(xOP)
  # [1] 3

  #4 here
  x1 <- c("ajjss","acdjfkj", "jj" , "auyjyjjksjj")
  f(x1)
  # [1] 4

  #8 here
  x2 <- c("jjbjj" , "ajjss","acdjfkj", "jj" , "auyjyjjksjj" , "jjbjj")
  f(x2)
  # [1] 8

  #Doesn't work yet with multiple jjjj matches. We want this to also be 8
  x3 <- c("jjjj" , "ajjss","acdjfkj", "jj" , "auyjyjjksjj" , "jjbjj")
  f(x3)
  # [1] 7  


来源:https://stackoverflow.com/questions/15600760/determine-frequency-of-string-using-grep

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!