n-gram

What are the most feasible options to do processing on google books n-gram dataset using modest resources?

假装没事ソ 提交于 2019-12-24 20:23:29
问题 I need to calculate word co-occurrence statistics for some 10,000 target words and few hundred context words, for each target word, from n-gram corpus of google books Below is the link of the full dataset: Google Ngram Viewer As evident database is approximately of 2.2TB and contains few hundred billions of rows. For computing word co-occurrence statistics I need to process the whole data for each possible pair of target and context word . I am currently considering using Hadoop with Hive for

How to extract n-gram word sequences from text in Postgres

大憨熊 提交于 2019-12-24 04:57:06
问题 I am hoping to use Postgres to extract sequences of words from Text. For example the whole word trigrams for the following sentence "ed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium" would be "ed ut perspiciatis" "ut perspiciatis unde" "perspiciatis unde omnis" ... I have been doing this with R but I am hoping Postgres would be able to handle it more efficiently. I have seen a similar question asked here n-grams from text in PostgreSQL but I don't understand how to

Form bigrams without stopwords in R

末鹿安然 提交于 2019-12-24 01:59:15
问题 I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining". Let's say if I have a string as follows: "IBM have a great success in the computer industry for the past decades..." After removing stopwords("have","a","in","the","for"), "IBM great success computer industry past decades..." In a result, bigrams like "success computer" or "industry past" will occur. But what I really need is

Return only results that match enough NGrams with Solr

不问归期 提交于 2019-12-24 01:25:12
问题 To achieve some degree of fault tolerance with Solr I have started to use the NGramFilterFactory . Here are the intersting bits from the schema.xml : <field name="text" type="text" indexed="true" stored="true"/> <copyField source="text" dest="text_ngram" /> <field name="text_ngram" type="text_ngram" indexed="true" stored="false"/> <fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.KeywordTokenizerFactory" /> <filter class="solr

Generate ngrams with Julia

你说的曾经没有我的故事 提交于 2019-12-23 12:42:57
问题 To generate word bigrams in Julia, I could simply zip through the original list and a list that drops the first element, e.g.: julia> s = split("the lazy fox jumps over the brown dog") 8-element Array{SubString{String},1}: "the" "lazy" "fox" "jumps" "over" "the" "brown" "dog" julia> collect(zip(s, drop(s,1))) 7-element Array{Tuple{SubString{String},SubString{String}},1}: ("the","lazy") ("lazy","fox") ("fox","jumps") ("jumps","over") ("over","the") ("the","brown") ("brown","dog") To generate a

Counting bigrams real fast (with or without multiprocessing) - python

泪湿孤枕 提交于 2019-12-23 07:47:27
问题 Given the big.txt from norvig.com/big.txt, the goal is to count the bigrams really fast (Imagine that I have to repeat this counting 100,000 times). According to Fast/Optimize N-gram implementations in python, extracting bigrams like this would be the most optimal: _bigrams = zip(*[text[i:] for i in range(2)]) And if I'm using Python3 , the generator won't be evaluated until i materialize it with list(_bigrams) or some other functions that will do the same. import io from collections import

Favor exact matches over nGram in elasticsearch

百般思念 提交于 2019-12-22 10:23:19
问题 I am trying to map a field as nGram and 'exact' match, and make the exact matches appear first in the search results. This is an answer to a similar question, but I am struggling to make it work. No matter what boost value I specify for the 'exact' field I get the same results order each time. This is how my field mapping looks: "name" : { "type" : "multi_field", "fields" : { "name" : { "type" : "string", "boost" : 2.0, "analyzer" : "ngram" }, "exact" : { "type" : "string", "boost" : 4.0,

Creating a dictionary for each word in a file and counting the frequency of words that follow it

我是研究僧i 提交于 2019-12-22 04:03:49
问题 I am trying to solve a difficult problem and am getting lost. Here's what I'm supposed to do: INPUT: file OUTPUT: dictionary Return a dictionary whose keys are all the words in the file (broken by whitespace). The value for each word is a dictionary containing each word that can follow the key and a count for the number of times it follows it. You should lowercase everything. Use strip and string.punctuation to strip the punctuation from the words. Example: >>> #example.txt is a file

Compute ngrams for each row of text data in R

不羁岁月 提交于 2019-12-21 21:43:24
问题 I have a data column of the following format: Text Hello world Hello How are you today I love stackoverflow blah blah blahdy I would like to compute the 3-grams for each row in this dataset by perhaps using the tau package's textcnt() function. However, when I tried it, it gave me one numeric vector with the ngrams for the entire column. How can I apply this function to each observation in my data separately? 回答1: Is this what you're after? library("RWeka") library("tm") TrigramTokenizer <-

Elasticsearch - EdgeNgram + highlight + term_vector = bad highlights

一曲冷凌霜 提交于 2019-12-21 12:39:55
问题 When i use an analyzer with edgengram (min=3, max=7, front) + term_vector=with_positions_offsets With document having text = "CouchDB" When i search for "couc" My highlight is on "cou" and not "couc" It seems my highlight is only on the minimum matching token "cou" while i would expect to be on the exact token (if possible) or at least the longest token found. It works fine without analyzing the text with term_vector=with_positions_offsets What's the impact of removing the term_vector=with