How can I split a word into bi-grams, including repeated ones?

爱⌒轻易说出口 提交于 2019-12-23 17:39:32

问题


I am trying to split a word into bi-grams. I am using the qlcMatrix package, but it only returns distinct bi-grams. For example, for the word "detected", it only returns "te" once. This is the command I use

test_domain <- c("detected")
library("qlcMatrix", lib.loc="~/R/win-library/3.2")
bigram1 <- splitStrings(test_domain, sep = "", bigrams = TRUE, left.boundary = "", right.boundary = "")$bigrams

and this is the result I get:

bigram1
# [1] "ec" "ed" "de" "te" "ct" "et"

回答1:


Another way to do it with base R is to use mapply and substr:

nc <- nchar("detected")
mapply(function(x, y){substr("detected", x, y)}, x=1:(nc-1), y=2:nc)
# [1] "de" "et" "te" "ec" "ct" "te" "ed"



回答2:


You can do that without packages:

test_domain <- c("detected")
temp <- strsplit(test_domain ,'')[[1]]
sapply(1:(length(temp)-1), function(x){paste(temp[x:(x+1)], collapse='')})
# [1] "de" "et" "te" "ec" "ct" "te" "ed"


来源:https://stackoverflow.com/questions/34083585/how-can-i-split-a-word-into-bi-grams-including-repeated-ones

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!