n-gram | 易学教程

ngram representation and distance matrix in R

阅读更多关于 ngram representation and distance matrix in R

问题 Assume that we have this data: a <- c("ham","bamm","comb") for 1-gram, this is the matrix representation of the above list. # h a m b c o # 1 1 1 0 0 0 # 0 1 2 1 0 0 # 0 0 1 1 1 1 I know that table(strsplit(a,split = "")[i]) for i in 1:length(a) will give the separated count for each of them. But I don't know how use rbind to make them as a whole since the lengths and column names are different. After that, I want to use either Euclidean or Manhattan distance to find the similarity matrix for

Solr NGramTokenizerFactory and PatternReplaceCharFilterFactory - Analyzer results inconsistent with Query Results

阅读更多关于 Solr NGramTokenizerFactory and PatternReplaceCharFilterFactory - Analyzer results inconsistent with Query Results

问题 I am currently using what I (mistakenly) thought would be a fairly straightforward implementation of Solr's NGramTokenizerFactory , but I'm getting strange results that are inconsistent between the admin analyzer and actual query results, and I'm hoping for some guidance. I am trying to get user inputs to match my NGram (minGramSize=2, maxGramSize=2) index. My schema for indexing and query time is below, in which I strip all non alphanumeric characters using PatternReplaceCharFilter . I

Detecting foreign words

阅读更多关于 Detecting foreign words

问题 I am writing a script to detect words from a language B in a language A. The two languages are very similar and may have instances of the same words. The code is here if you are interested in what I have so far: https://github.com/arashsa/language-detection.git I will explain my method here: I create a list of bigrams in language B, a list of bigrams in language A (small corpus in language B, large corpus in language A). Then I remove all bigrams that are common. Then I go through the text in

R and tm package: create a term-document matrix with a dictionary of one or two words?

阅读更多关于 R and tm package: create a term-document matrix with a dictionary of one or two words?

问题 Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams , as some of the keywords . Web Search: Being new to text-mining and the tm package in R , I went to the web to figure out how to do this. Below are some relevant links that I found: FAQS on the tm-package website finding 2 & 3 word phrases using r tm package counter ngram with tm package in r findassocs for multiple terms in r Background: Of these, I preferred the solution that uses

counting n-gram frequency in python nltk

阅读更多关于 counting n-gram frequency in python nltk

问题 I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide what frequency to set for filtering. As you can see I am using the nltk collocations class. import nltk from nltk.collocations import * line = "" open_file = open('a_text_file','r') for val in open_file: line += val tokens = line

n-grams from text in python

阅读更多关于 n-grams from text in python

问题 An update to my previous post, with some changes: Say that I have 100 tweets. In those tweets, I need to extract: 1) food names, and 2) beverage names. I also need to attach type (drink or food) and an id-number (each item has a unique id) for each extraction. I already have a lexicon with names, type and id-number: lexicon = { 'dr pepper': {'type': 'drink', 'id': 'd_123'}, 'coca cola': {'type': 'drink', 'id': 'd_234'}, 'cola': {'type': 'drink', 'id': 'd_345'}, 'banana': {'type': 'food', 'id'

从n-gram中文文本纠错，到依存树中文语法纠错以及同义词查找

阅读更多关于从n-gram中文文本纠错，到依存树中文语法纠错以及同义词查找

前记本文简单地讲解如何使用n-gram模型结合汉字拼音来作中文错别字纠错，然后介绍最短编辑距离在中文搜索纠错方面的应用；最后从依赖树入手讲解如何作文本长距离纠错（语法纠错），并从该方法中得到一种启示，利用依赖树的特点结合ESA算法来做同义词的查找。 n-gram模型在中文错别字查错情景中，我们判断一个句子是否合法可以通过计算它的概率来得到，假设一个句子S = {w1, w2, ..., wn}，则问题可以转换成如下形式： P(S)被称为语言模型，即用来计算一个句子合法概率的模型。但是使用上式会出现很多问题，参数空间过大，信息矩阵严重稀疏，这时就有了n-gram模型，它基于马尔科夫模型假设，一个词的出现概率仅依赖于该词的前1个词或前几个词，则有（1）一个词的出现仅依赖于前1个词，即 Bigram （2-gram）：（2）一个词的出现仅依赖于前2个词，即 Trigram （3-gram）：当n-gram的n值越大时，对下一个词的约束力就越强，因为提供的信息越多，但同时模型就越复杂，问题越多，所以一般采用bigram或trigram。下面举一个简单的例子，说明n-gram的具体使用： n-gram模型通过计算极大似然估计（Maximum Likelihood Estimate）构造语言模型，这是对训练数据的最佳估计，对于Bigram其计算公式如下：

Rails sunspot-solr - words with hyphen

阅读更多关于 Rails sunspot-solr - words with hyphen

问题 I'm using the sunspot_rails gem and everything is working perfect so far but: I'm not getting any search results for words with a hyphen. Example: The string "tron" returns a lot of results(the word mentioned in all articles is e-tron) The string "e-tron" returns 0 results even though this is the correct word mentioned in all my articles. My current schema.xml config: <fieldType name="text" class="solr.TextField" omitNorms="false"> <analyzer type="index"> <tokenizer class="solr

2-gram and 3-gram instead of 1-gram using RWeka

阅读更多关于 2-gram and 3-gram instead of 1-gram using RWeka

问题 I am trying to extract 1-gram, 2-gram and 3-gram from the train corpus, using RWeka NGramTokenizer function. Unfortunately, getting only 1-grams. There is my code: train_corpus # clean-up cleanset1<- tm_map(train_corpus, tolower) cleanset2<- tm_map(cleanset1, removeNumbers) cleanset3<- tm_map(cleanset2, removeWords, stopwords("english")) cleanset4<- tm_map(cleanset3, removePunctuation) cleanset5<- tm_map(cleanset4, stemDocument, language="english") cleanset6<- tm_map(cleanset5,

Is there a more efficient way to find most common n-grams?

阅读更多关于 Is there a more efficient way to find most common n-grams?

问题 I'm trying to find k most common n-grams from a large corpus. I've seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a dictionary of the count of all n-grams. Is there a better way to do this? 回答1: In Python, using NLTK: $ wget http://norvig.com/big.txt $ python >>> from collections import Counter >>> from nltk import ngrams >>> bigtxt = open('big.txt').read() >>> ngram_counts = Counter(ngrams(bigtxt.split(), 2)) >>> ngram_counts.most