n-gram

ngram representation and distance matrix in R

走远了吗. 提交于 2019-12-11 11:04:25
问题 Assume that we have this data: a <- c("ham","bamm","comb") for 1-gram, this is the matrix representation of the above list. # h a m b c o # 1 1 1 0 0 0 # 0 1 2 1 0 0 # 0 0 1 1 1 1 I know that table(strsplit(a,split = "")[i]) for i in 1:length(a) will give the separated count for each of them. But I don't know how use rbind to make them as a whole since the lengths and column names are different. After that, I want to use either Euclidean or Manhattan distance to find the similarity matrix for

Solr NGramTokenizerFactory and PatternReplaceCharFilterFactory - Analyzer results inconsistent with Query Results

大兔子大兔子 提交于 2019-12-11 07:59:17
问题 I am currently using what I (mistakenly) thought would be a fairly straightforward implementation of Solr's NGramTokenizerFactory , but I'm getting strange results that are inconsistent between the admin analyzer and actual query results, and I'm hoping for some guidance. I am trying to get user inputs to match my NGram (minGramSize=2, maxGramSize=2) index. My schema for indexing and query time is below, in which I strip all non alphanumeric characters using PatternReplaceCharFilter . I

Detecting foreign words

℡╲_俬逩灬. 提交于 2019-12-10 19:39:16
问题 I am writing a script to detect words from a language B in a language A. The two languages are very similar and may have instances of the same words. The code is here if you are interested in what I have so far: https://github.com/arashsa/language-detection.git I will explain my method here: I create a list of bigrams in language B, a list of bigrams in language A (small corpus in language B, large corpus in language A). Then I remove all bigrams that are common. Then I go through the text in

R and tm package: create a term-document matrix with a dictionary of one or two words?

谁说我不能喝 提交于 2019-12-09 07:01:59
问题 Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams , as some of the keywords . Web Search: Being new to text-mining and the tm package in R , I went to the web to figure out how to do this. Below are some relevant links that I found: FAQS on the tm-package website finding 2 & 3 word phrases using r tm package counter ngram with tm package in r findassocs for multiple terms in r Background: Of these, I preferred the solution that uses

counting n-gram frequency in python nltk

你离开我真会死。 提交于 2019-12-08 22:46:44
问题 I have the following code. I know that I can use apply_freq_filter function to filter out collocations that are less than a frequency count. However, I don't know how to get the frequencies of all the n-gram tuples (in my case bi-gram) in a document, before I decide what frequency to set for filtering. As you can see I am using the nltk collocations class. import nltk from nltk.collocations import * line = "" open_file = open('a_text_file','r') for val in open_file: line += val tokens = line

n-grams from text in python

浪子不回头ぞ 提交于 2019-12-07 20:40:24
问题 An update to my previous post, with some changes: Say that I have 100 tweets. In those tweets, I need to extract: 1) food names, and 2) beverage names. I also need to attach type (drink or food) and an id-number (each item has a unique id) for each extraction. I already have a lexicon with names, type and id-number: lexicon = { 'dr pepper': {'type': 'drink', 'id': 'd_123'}, 'coca cola': {'type': 'drink', 'id': 'd_234'}, 'cola': {'type': 'drink', 'id': 'd_345'}, 'banana': {'type': 'food', 'id'

从n-gram中文文本纠错,到依存树中文语法纠错以及同义词查找

有些话、适合烂在心里 提交于 2019-12-07 20:00:23
前记 本文简单地讲解如何使用n-gram模型结合汉字拼音来作中文错别字纠错,然后介绍最短编辑距离在中文搜索纠错方面的应用;最后从依赖树入手讲解如何作文本长距离纠错(语法纠错),并从该方法中得到一种启示,利用依赖树的特点结合ESA算法来做同义词的查找。 n-gram模型 在中文错别字查错情景中,我们判断一个句子是否合法可以通过计算它的概率来得到,假设一个句子S = {w1, w2, ..., wn},则问题可以转换成如下形式: P(S)被称为 语言模型 ,即用来计算一个句子合法概率的模型。 但是使用上式会出现很多问题,参数空间过大,信息矩阵严重稀疏,这时就有了n-gram模型,它基于 马尔科夫模型假设 ,一个词的出现概率仅依赖于该词的前1个词或前几个词,则有 (1)一个词的出现仅依赖于前1个词,即 Bigram (2-gram): (2)一个词的出现仅依赖于前2个词,即 Trigram (3-gram): 当n-gram的n值越大时,对下一个词的约束力就越强,因为提供的信息越多,但同时模型就越复杂,问题越多,所以一般采用bigram或trigram。下面举一个 简单的例子 ,说明n-gram的具体使用: n-gram模型通过计算 极大似然估计 (Maximum Likelihood Estimate)构造语言模型,这是对训练数据的最佳估计,对于Bigram其计算公式如下:

Rails sunspot-solr - words with hyphen

爷,独闯天下 提交于 2019-12-07 13:59:57
问题 I'm using the sunspot_rails gem and everything is working perfect so far but: I'm not getting any search results for words with a hyphen. Example: The string "tron" returns a lot of results(the word mentioned in all articles is e-tron) The string "e-tron" returns 0 results even though this is the correct word mentioned in all my articles. My current schema.xml config: <fieldType name="text" class="solr.TextField" omitNorms="false"> <analyzer type="index"> <tokenizer class="solr

2-gram and 3-gram instead of 1-gram using RWeka

杀马特。学长 韩版系。学妹 提交于 2019-12-07 13:48:17
问题 I am trying to extract 1-gram, 2-gram and 3-gram from the train corpus, using RWeka NGramTokenizer function. Unfortunately, getting only 1-grams. There is my code: train_corpus # clean-up cleanset1<- tm_map(train_corpus, tolower) cleanset2<- tm_map(cleanset1, removeNumbers) cleanset3<- tm_map(cleanset2, removeWords, stopwords("english")) cleanset4<- tm_map(cleanset3, removePunctuation) cleanset5<- tm_map(cleanset4, stemDocument, language="english") cleanset6<- tm_map(cleanset5,

Is there a more efficient way to find most common n-grams?

北慕城南 提交于 2019-12-07 13:03:01
问题 I'm trying to find k most common n-grams from a large corpus. I've seen lots of places suggesting the naïve approach - simply scanning through the entire corpus and keeping a dictionary of the count of all n-grams. Is there a better way to do this? 回答1: In Python, using NLTK: $ wget http://norvig.com/big.txt $ python >>> from collections import Counter >>> from nltk import ngrams >>> bigtxt = open('big.txt').read() >>> ngram_counts = Counter(ngrams(bigtxt.split(), 2)) >>> ngram_counts.most