n-gram

ElasticSearch n-gram tokenfilter not finding partial words

那年仲夏 提交于 2019-12-20 20:05:12
问题 I have been playing around with ElasticSearch for a new project of mine. I have set the default analyzers to use the ngram tokenfilter. This is my elasticsearch.yml file: index: analysis: analyzer: default_index: tokenizer: standard filter: [standard, stop, mynGram] default_search: tokenizer: standard filter: [standard, stop] filter: mynGram: type: nGram min_gram: 1 max_gram: 10 I created a new index and added the following document to it: $ curl -XPUT http://localhost:9200/test/newtype/3 -d

How to get n-gram collocations and association in python nltk?

你离开我真会死。 提交于 2019-12-20 15:35:56
问题 In this documentation, there is example using nltk.collocations.BigramAssocMeasures() , BigramCollocationFinder , nltk.collocations.TrigramAssocMeasures() , and TrigramCollocationFinder . There is example method find nbest based on pmi for bigram and trigram. example: finder = BigramCollocationFinder.from_words( ... nltk.corpus.genesis.words('english-web.txt')) >>> finder.nbest(bigram_measures.pmi, 10) I know that BigramCollocationFinder and TrigramCollocationFinder inherit from

Frequency of ngrams (strings) in tokenized text

て烟熏妆下的殇ゞ 提交于 2019-12-20 03:45:06
问题 I have a set of unique ngrams (list called ngramlist) and ngram tokenized text (list called ngrams). I want to construct a new vector, freqlist, where each element of freqlist is the fraction of ngrams that is equal to that element of ngramlist. I wrote the following code that gives the correct output, but I wonder if there is a way to optimize it: freqlist = [ sum(int(ngram == ngram_condidate) for ngram_condidate in ngrams) / len(ngrams) for ngram in ngramlist ] I imagine there is a function

quicker way to detect n-grams in a string?

半世苍凉 提交于 2019-12-19 04:39:32
问题 I found this solution on SO to detect n-grams in a string: (here: N-gram generation from a sentence) import java.util.*; public class Test { public static List<String> ngrams(int n, String str) { List<String> ngrams = new ArrayList<String>(); String[] words = str.split(" "); for (int i = 0; i < words.length - n + 1; i++) ngrams.add(concat(words, i, i+n)); return ngrams; } public static String concat(String[] words, int start, int end) { StringBuilder sb = new StringBuilder(); for (int i =

Getting most likely documents of the query using phonetic filter in solr

妖精的绣舞 提交于 2019-12-18 07:15:20
问题 I am using solr for spell checking/ query correction . I have added solr.PhoneticFilterFactory and solr.NGramFilterFactory in fieldType to perform spell checking . It is working fine but here the problem is that I am getting number of documents of the query. I need only most likely words/documents or in similar words, we can say that nearer words/documents to the query . Snippet of schema.xml : <fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100"> <analyzer type=

Is there an alternate for the now removed module 'nltk.model.NGramModel'?

 ̄綄美尐妖づ 提交于 2019-12-18 03:34:27
问题 I've been trying to find out an alternative for two straight days now, and couldn't find anything relevant. I'm basically trying to get a probabilistic score of a synthesized sentence (synthesized by replacing some words from an original sentence picked from the corpora). I tried Collocations, but the scores that I'm getting aren't very helpful. So I tried making use of the language model concept, only to find that the seemingly helpful module 'model' has been removed from NLTK because of

Is there an alternate for the now removed module 'nltk.model.NGramModel'?

℡╲_俬逩灬. 提交于 2019-12-18 03:34:04
问题 I've been trying to find out an alternative for two straight days now, and couldn't find anything relevant. I'm basically trying to get a probabilistic score of a synthesized sentence (synthesized by replacing some words from an original sentence picked from the corpora). I tried Collocations, but the scores that I'm getting aren't very helpful. So I tried making use of the language model concept, only to find that the seemingly helpful module 'model' has been removed from NLTK because of

Fast/Optimize N-gram implementations in python

旧城冷巷雨未停 提交于 2019-12-17 06:50:10
问题 Which ngram implementation is fastest in python? I've tried to profile nltk's vs scott's zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/): from nltk.util import ngrams as nltkngram import this, time def zipngram(text,n=2): return zip(*[text.split()[i:] for i in range(n)]) text = this.s start = time.time() nltkngram(text.split(), n=2) print time.time() - start start = time.time() zipngram(text, n=2) print time.time() - start [out] 0.000213146209717 6

Python: Reducing memory usage of dictionary

孤街醉人 提交于 2019-12-17 06:23:17
问题 I'm trying to load a couple of files into the memory. The files have either of the following 3 formats: string TAB int string TAB float int TAB float. Indeed, they are ngram statics files, in case this helps with the solution. For instance: i_love TAB 10 love_you TAB 12 Currently, the pseudocode of I'm doing right now is loadData(file): data = {} for line in file: first, second = line.split('\t') data[first] = int(second) #or float(second) return data To much of my surprise, while the total

Selecting Random Item from List given probability of each item

与世无争的帅哥 提交于 2019-12-12 18:13:57
问题 Sorry about badly phrased title.... I have an object called NGram class NGram { //other properties double Probability {get; set;} //Value between 1 and 0 } Now suppose I have a list of these objects such that... List<NGrams> grams = GetNGrams(); Debug.Assert(grams.Sum(x => x.Probability) == 1); How can I select a random item from this list while factoring in the probability distribution. For instance, suppose grams[0].Probability == 0.5 then there should be a 50% chance of selecting grams[0]