What algorithm is used for finding ngrams?
Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?
You can use ngram package. One example of its usage is http://amunategui.github.io/speak-like-a-doctor/
For anyone still interested in this topic, there is a package on the cran already.
ngram: An n-gram Babbler
This package offers utilities for creating, displaying, and "babbling" n-grams. The babbler is a simple Markov process.
http://cran.r-project.org/web/packages/ngram/index.html
Simple heres the java answer:
int ngrams = 9;// let's say 9-grams since it's the length of "bonasuera"...
String string = "bonasuera";
for (int j=1; j <= ngrams;j++) {
for (int k=0; k < string.length()-j+1;k++ )
System.out.print(string.substring(k,k+j) + " ");
System.out.println();
}
output :
b o n a s u e r a
bo on na as su ue er ra
bon ona nas asu sue uer era
bona onas nasu asue suer uera
bonas onasu nasue asuer suera
bonasu onasue nasuer asuera
bonasue onasuer nasuera
bonasuer onasuera
bonasuera
EDIT: Sorry, this is PHP. I wasn't quite sure what you wanted. I don't know it in java but perhaps the following could be converted easily enough.
Well it depends on the size of the ngrams you want.
I've had quite a lot of success with single letters (especially accurate for language detection), which is easy to get with:
$letters=str_split(preg_replace('/[^a-z]/', '', strtolower($text)));
$letters=array_count_values($letters);
Then there is the following function for calculating ngrams from a word:
function getNgrams($word, $n = 3) {
$ngrams = array();
$len = strlen($word);
for($i = 0; $i < $len; $i++) {
if($i > ($n - 2)) {
$ng = '';
for($j = $n-1; $j >= 0; $j--) {
$ng .= $word[$i-$j];
}
$ngrams[] = $ng;
}
}
return $ngrams;
}
The source of the above is here, which I recommend you read, and they have lots of functions to do exactly what you want.
If you want to use R
to identify ngrams, you can use the tm
package and the RWeka
package. It will tell you how many times the ngram occurs in your documents, like so:
library("RWeka")
library("tm")
data("crude")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(tdm[340:345,1:10])
A term-document matrix (6 terms, 10 documents)
Non-/sparse entries: 4/56
Sparsity : 93%
Maximal term length: 13
Weighting : term frequency (tf)
Docs
Terms 127 144 191 194 211 236 237 242 246 248
and said 0 0 0 0 0 0 0 0 0 0
and security 0 0 0 0 0 0 0 0 1 0
and set 0 1 0 0 0 0 0 0 0 0
and six-month 0 0 0 0 0 0 0 1 0 0
and some 0 0 0 0 0 0 0 0 0 0
and stabilise 0 0 0 0 0 0 0 0 0 1
hat-tip: http://tm.r-forge.r-project.org/faq.html
Usually the n-grams are calculated to find its frequency distribution. So Yes, it does matter how many times the n-grams appear.
Also you want character level n-gram or word level n-gram. I have written a code for finding the character level n-gram from a csv file in r. I used package 'tau' for that. You can find it here.
Also here is the code I wrote:
library(tau)
temp<-read.csv("/home/aravi/Documents/sample/csv/ex.csv",header=FALSE,stringsAsFactors=F)
r<-textcnt(temp, method="ngram",n=4L,split = "[[:space:][:punct:]]+", decreasing=TRUE)
a<-data.frame(counts = unclass(r), size = nchar(names(r)))
b<-split(a,a$size)
b
Cheers!