What algorithm I need to find n-grams?

前端 未结 7 461
無奈伤痛
無奈伤痛 2020-12-04 16:56

What algorithm is used for finding ngrams?

Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?

相关标签:
7条回答
  • 2020-12-04 17:33

    You can use ngram package. One example of its usage is http://amunategui.github.io/speak-like-a-doctor/

    0 讨论(0)
  • 2020-12-04 17:38

    For anyone still interested in this topic, there is a package on the cran already.

    ngram: An n-gram Babbler

    This package offers utilities for creating, displaying, and "babbling" n-grams. The babbler is a simple Markov process.

    http://cran.r-project.org/web/packages/ngram/index.html

    0 讨论(0)
  • 2020-12-04 17:42

    Simple heres the java answer:

    int ngrams = 9;// let's say 9-grams since it's the length of "bonasuera"... 
    String string = "bonasuera";
    for (int j=1; j <= ngrams;j++) {    
        for (int k=0; k < string.length()-j+1;k++ )
            System.out.print(string.substring(k,k+j) + " ");
        System.out.println();
    }
    

    output :

    b o n a s u e r a 
    bo on na as su ue er ra 
    bon ona nas asu sue uer era 
    bona onas nasu asue suer uera 
    bonas onasu nasue asuer suera 
    bonasu onasue nasuer asuera 
    bonasue onasuer nasuera 
    bonasuer onasuera 
    bonasuera 
    
    0 讨论(0)
  • 2020-12-04 17:44

    EDIT: Sorry, this is PHP. I wasn't quite sure what you wanted. I don't know it in java but perhaps the following could be converted easily enough.

    Well it depends on the size of the ngrams you want.

    I've had quite a lot of success with single letters (especially accurate for language detection), which is easy to get with:

    $letters=str_split(preg_replace('/[^a-z]/', '', strtolower($text)));
    $letters=array_count_values($letters);
    

    Then there is the following function for calculating ngrams from a word:

    function getNgrams($word, $n = 3) {
            $ngrams = array();
            $len = strlen($word);
            for($i = 0; $i < $len; $i++) {
                    if($i > ($n - 2)) {
                            $ng = '';
                            for($j = $n-1; $j >= 0; $j--) {
                                    $ng .= $word[$i-$j];
                            }
                            $ngrams[] = $ng;
                    }
            }
            return $ngrams;
    }
    

    The source of the above is here, which I recommend you read, and they have lots of functions to do exactly what you want.

    0 讨论(0)
  • 2020-12-04 17:46

    If you want to use R to identify ngrams, you can use the tm package and the RWeka package. It will tell you how many times the ngram occurs in your documents, like so:

      library("RWeka")
      library("tm")
    
      data("crude")
    
      BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
      tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
    
      inspect(tdm[340:345,1:10])
    
    A term-document matrix (6 terms, 10 documents)
    
    Non-/sparse entries: 4/56
    Sparsity           : 93%
    Maximal term length: 13 
    Weighting          : term frequency (tf)
    
                   Docs
    Terms           127 144 191 194 211 236 237 242 246 248
      and said        0   0   0   0   0   0   0   0   0   0
      and security    0   0   0   0   0   0   0   0   1   0
      and set         0   1   0   0   0   0   0   0   0   0
      and six-month   0   0   0   0   0   0   0   1   0   0
      and some        0   0   0   0   0   0   0   0   0   0
      and stabilise   0   0   0   0   0   0   0   0   0   1
    

    hat-tip: http://tm.r-forge.r-project.org/faq.html

    0 讨论(0)
  • 2020-12-04 17:50

    Usually the n-grams are calculated to find its frequency distribution. So Yes, it does matter how many times the n-grams appear.

    Also you want character level n-gram or word level n-gram. I have written a code for finding the character level n-gram from a csv file in r. I used package 'tau' for that. You can find it here.

    Also here is the code I wrote:

     library(tau)
    temp<-read.csv("/home/aravi/Documents/sample/csv/ex.csv",header=FALSE,stringsAsFactors=F)
    r<-textcnt(temp, method="ngram",n=4L,split = "[[:space:][:punct:]]+", decreasing=TRUE)
    a<-data.frame(counts = unclass(r), size = nchar(names(r)))
    b<-split(a,a$size)
    b
    

    Cheers!

    0 讨论(0)
提交回复
热议问题