n-gram | 易学教程

ElasticSearch n-gram tokenfilter not finding partial words

阅读更多关于 ElasticSearch n-gram tokenfilter not finding partial words

问题 I have been playing around with ElasticSearch for a new project of mine. I have set the default analyzers to use the ngram tokenfilter. This is my elasticsearch.yml file: index: analysis: analyzer: default_index: tokenizer: standard filter: [standard, stop, mynGram] default_search: tokenizer: standard filter: [standard, stop] filter: mynGram: type: nGram min_gram: 1 max_gram: 10 I created a new index and added the following document to it: $ curl -XPUT http://localhost:9200/test/newtype/3 -d

How to get n-gram collocations and association in python nltk?

阅读更多关于 How to get n-gram collocations and association in python nltk?

问题 In this documentation, there is example using nltk.collocations.BigramAssocMeasures() , BigramCollocationFinder , nltk.collocations.TrigramAssocMeasures() , and TrigramCollocationFinder . There is example method find nbest based on pmi for bigram and trigram. example: finder = BigramCollocationFinder.from_words( ... nltk.corpus.genesis.words('english-web.txt')) >>> finder.nbest(bigram_measures.pmi, 10) I know that BigramCollocationFinder and TrigramCollocationFinder inherit from

Frequency of ngrams (strings) in tokenized text

阅读更多关于 Frequency of ngrams (strings) in tokenized text

问题 I have a set of unique ngrams (list called ngramlist) and ngram tokenized text (list called ngrams). I want to construct a new vector, freqlist, where each element of freqlist is the fraction of ngrams that is equal to that element of ngramlist. I wrote the following code that gives the correct output, but I wonder if there is a way to optimize it: freqlist = [ sum(int(ngram == ngram_condidate) for ngram_condidate in ngrams) / len(ngrams) for ngram in ngramlist ] I imagine there is a function

quicker way to detect n-grams in a string?

阅读更多关于 quicker way to detect n-grams in a string?

问题 I found this solution on SO to detect n-grams in a string: (here: N-gram generation from a sentence) import java.util.*; public class Test { public static List<String> ngrams(int n, String str) { List<String> ngrams = new ArrayList<String>(); String[] words = str.split(" "); for (int i = 0; i < words.length - n + 1; i++) ngrams.add(concat(words, i, i+n)); return ngrams; } public static String concat(String[] words, int start, int end) { StringBuilder sb = new StringBuilder(); for (int i =

Getting most likely documents of the query using phonetic filter in solr

阅读更多关于 Getting most likely documents of the query using phonetic filter in solr

问题 I am using solr for spell checking/ query correction . I have added solr.PhoneticFilterFactory and solr.NGramFilterFactory in fieldType to perform spell checking . It is working fine but here the problem is that I am getting number of documents of the query. I need only most likely words/documents or in similar words, we can say that nearer words/documents to the query . Snippet of schema.xml : <fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100"> <analyzer type=

Is there an alternate for the now removed module 'nltk.model.NGramModel'?

阅读更多关于 Is there an alternate for the now removed module 'nltk.model.NGramModel'?

问题 I've been trying to find out an alternative for two straight days now, and couldn't find anything relevant. I'm basically trying to get a probabilistic score of a synthesized sentence (synthesized by replacing some words from an original sentence picked from the corpora). I tried Collocations, but the scores that I'm getting aren't very helpful. So I tried making use of the language model concept, only to find that the seemingly helpful module 'model' has been removed from NLTK because of

Is there an alternate for the now removed module 'nltk.model.NGramModel'?

阅读更多关于 Is there an alternate for the now removed module 'nltk.model.NGramModel'?

Fast/Optimize N-gram implementations in python

阅读更多关于 Fast/Optimize N-gram implementations in python

问题 Which ngram implementation is fastest in python? I've tried to profile nltk's vs scott's zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/): from nltk.util import ngrams as nltkngram import this, time def zipngram(text,n=2): return zip(*[text.split()[i:] for i in range(n)]) text = this.s start = time.time() nltkngram(text.split(), n=2) print time.time() - start start = time.time() zipngram(text, n=2) print time.time() - start [out] 0.000213146209717 6

Python: Reducing memory usage of dictionary

阅读更多关于 Python: Reducing memory usage of dictionary

问题 I'm trying to load a couple of files into the memory. The files have either of the following 3 formats: string TAB int string TAB float int TAB float. Indeed, they are ngram statics files, in case this helps with the solution. For instance: i_love TAB 10 love_you TAB 12 Currently, the pseudocode of I'm doing right now is loadData(file): data = {} for line in file: first, second = line.split('\t') data[first] = int(second) #or float(second) return data To much of my surprise, while the total

Selecting Random Item from List given probability of each item

阅读更多关于 Selecting Random Item from List given probability of each item

问题 Sorry about badly phrased title.... I have an object called NGram class NGram { //other properties double Probability {get; set;} //Value between 1 and 0 } Now suppose I have a list of these objects such that... List<NGrams> grams = GetNGrams(); Debug.Assert(grams.Sum(x => x.Probability) == 1); How can I select a random item from this list while factoring in the probability distribution. For instance, suppose grams[0].Probability == 0.5 then there should be a 50% chance of selecting grams[0]