What algorithm is used for finding ngrams?
Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?
Usually the n-grams are calculated to find its frequency distribution. So Yes, it does matter how many times the n-grams appear.
Also you want character level n-gram or word level n-gram. I have written a code for finding the character level n-gram from a csv file in r. I used package 'tau' for that. You can find it here.
Also here is the code I wrote:
library(tau)
temp<-read.csv("/home/aravi/Documents/sample/csv/ex.csv",header=FALSE,stringsAsFactors=F)
r<-textcnt(temp, method="ngram",n=4L,split = "[[:space:][:punct:]]+", decreasing=TRUE)
a<-data.frame(counts = unclass(r), size = nchar(names(r)))
b<-split(a,a$size)
b
Cheers!