phrase

Solr: Phrase search when indexed phrase is shorter than the query

北城以北 提交于 2019-12-03 20:39:20
Is it possible to find a document if the indexed field is a part of the queried phrase? It it easy to find " Quick brown fox jumps over the lazy dog " when searching for " brown fox " or " lazy dog ". But what if I need to do it vice versa? Here's my situation: I have a short field in a document like " brown fox " or " lazy dog " and I want this document to be found by searching for longer phrases like " Quick brown fox " or " jump over lazy dog ". Note: it should be a phrase match, so making all the terms optional in the query wouldn't work. The query like " brown dog " SHOULD NOT match. How

How to search phrase queries in inverted index structure?

我只是一个虾纸丫 提交于 2019-12-03 15:14:48
If we want to search a query like this "t1 t2 t3" (t1,t2 ,t3 must be queued) in an inverted index structure , which ways should we do ? 1-First we search the "t1" term and find all documents that contains "t1" , then do this work for "t2" and then "t3" . Then find documents that positions of "t1" , "t2" and "t3" are next to each other . 2-First we search the "t1" term and find all documents that contains "t1" , then in all documents that we found , we search the "t2" and next , in the result of this , we find documents that contains "t3" . I have a full inverted index . I want to know which

Why can't I set the font of a Phrase (or Paragraph)?

落爺英雄遲暮 提交于 2019-12-02 17:20:00
问题 I'm starting with itextsharp and wondering if there is any reason why if I set the font of a phrase after the construction it doesn't work. Is there any reason, do I miss something? iTextSharp.text.Font f = PdfFontFactory.GetComic(); f.SetStyle(PdfFontStyle.BOLD); Color c = Color.DarkRed; f.SetColor(c.R,c.G,c.B); f.Size = 20; Document document = new Document(); try { PdfWriter.GetInstance(document, new System.IO.FileStream("PhraseTest.pdf", FileMode.Create)); document.SetPageSize(PageSize.A4)

Why can't I set the font of a Phrase (or Paragraph)?

笑着哭i 提交于 2019-12-02 11:26:48
I'm starting with itextsharp and wondering if there is any reason why if I set the font of a phrase after the construction it doesn't work. Is there any reason, do I miss something? iTextSharp.text.Font f = PdfFontFactory.GetComic(); f.SetStyle(PdfFontStyle.BOLD); Color c = Color.DarkRed; f.SetColor(c.R,c.G,c.B); f.Size = 20; Document document = new Document(); try { PdfWriter.GetInstance(document, new System.IO.FileStream("PhraseTest.pdf", FileMode.Create)); document.SetPageSize(PageSize.A4); document.Open(); Phrase titreFormules = new Phrase("Nos formules",f); //THIS WORKS // titreFormules

Counting phrase frequency in Python 3.3.2

依然范特西╮ 提交于 2019-12-01 04:27:02
I have been examining different sources on the web and have tried various methods but could only find how to count the frequency of unique words but not unique phrases. The code I have so far is as follows: import collections import re wanted = set(['inflation', 'gold', 'bank']) cnt = collections.Counter() words = re.findall('\w+', open('02.2003.BenBernanke.txt').read().lower()) for word in words: if word in wanted: cnt [word] += 1 print (cnt) If possible, I would also like to count the number of times the phrases 'central bank' and 'high inflation' is used in this text. I appreciate any

R Text Mining: Counting the number of times a specific word appears in a corpus?

独自空忆成欢 提交于 2019-11-30 15:40:46
I have seen this question answered in other languages but not in R. [Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. Now I would like to search for the number of times these phrases have appeared in another corpus. Is there a way to do this in TM package? (Or another related package) For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of couple thousand sub texts. I want to find out how many times each phrase in tags have appeared in CorpusB. As always, I appreciate all your help! Ain't perfect

How to prevent splitting specific words or phrases and numbers in NLTK?

久未见 提交于 2019-11-30 09:43:02
问题 I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like "run in my family" ,"30 minute walk" or "4x a day" from splitting at the time of tokenizing words in NLTK? They should not result in: ['runs','in','my','family','4x','a','day'] For example: Yes 20-30 minutes a day on my bike, it works great!! gives: ['yes','20-30','minutes','a','day','on','my','bike',',','it','works','great'] I want '20-30 minutes' to be

Counting phrase frequency in Python 3.3.2

ε祈祈猫儿з 提交于 2019-11-30 05:03:17
问题 I have been examining different sources on the web and have tried various methods but could only find how to count the frequency of unique words but not unique phrases. The code I have so far is as follows: import collections import re wanted = set(['inflation', 'gold', 'bank']) cnt = collections.Counter() words = re.findall('\w+', open('02.2003.BenBernanke.txt').read().lower()) for word in words: if word in wanted: cnt [word] += 1 print (cnt) If possible, I would also like to count the

R Text Mining: Counting the number of times a specific word appears in a corpus?

╄→гoц情女王★ 提交于 2019-11-29 23:27:01
问题 I have seen this question answered in other languages but not in R. [Specifically for R text mining] I have a set of frequent phrases that is obtained from a Corpus. Now I would like to search for the number of times these phrases have appeared in another corpus. Is there a way to do this in TM package? (Or another related package) For example, say I have an array of phrases, "tags" obtained from CorpusA. And another Corpus, CorpusB, of couple thousand sub texts. I want to find out how many

How to prevent splitting specific words or phrases and numbers in NLTK?

99封情书 提交于 2019-11-29 16:15:06
I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like "run in my family" ,"30 minute walk" or "4x a day" from splitting at the time of tokenizing words in NLTK? They should not result in: ['runs','in','my','family','4x','a','day'] For example: Yes 20-30 minutes a day on my bike, it works great!! gives: ['yes','20-30','minutes','a','day','on','my','bike',',','it','works','great'] I want '20-30 minutes' to be treated as a single word. How can I get this behavior>? You will be hard pressed to preserve n-grams of