I have written a function that finds the indices of subsequences in a long DNA sequence. It works when my longer DNA sequence is < about 4000 characters. However, when I
Rather than write your own function, why not use the function words.pos in package seqinr. It seems to work even for strings up to a million base pairs.
For example,
library(seqinr)
data(ec999)
myseq <- paste(ec999[[1]], collapse="")
myseq <- paste(rep(myseq,100), collapse="")
words.pos("atat", myseq)
I can replicate nrussell's example, but this assigns correctly x<-paste0(rep("abcdef",1000),collapse="")
-- a potential workaround is writing the character string to a .txt
file and reading the .txt
file into R directly:
test.txt is a 6,000 character long string.
`test<-read.table('test.txt',stringsAsFactors = FALSE)
length(class(test[1,1]))
[1] 1
class(test[1,1])
[1] "character"
nchar(test[1,1])
[1] 6000`