information-retrieval

How to parse the data from Google Alerts?

自闭症网瘾萝莉.ら 提交于 2019-12-17 07:13:14
问题 Firstly, How would you get Google Alerts information into a database other than to parse the text of the email message that Google sends you? It seems that there is no Google Alerts API. If you must parse text, how would you go about parsing out the relevant pieces of the email message? 回答1: When you create the alert, set the "Deliver To" to "Feed" and then you can consume the feed XML as you would any other feed. This is much easier to parse and digest into a database. 回答2: class

Fast/Optimize N-gram implementations in python

旧城冷巷雨未停 提交于 2019-12-17 06:50:10
问题 Which ngram implementation is fastest in python? I've tried to profile nltk's vs scott's zip (http://locallyoptimal.com/blog/2013/01/20/elegant-n-gram-generation-in-python/): from nltk.util import ngrams as nltkngram import this, time def zipngram(text,n=2): return zip(*[text.split()[i:] for i in range(n)]) text = this.s start = time.time() nltkngram(text.split(), n=2) print time.time() - start start = time.time() zipngram(text, n=2) print time.time() - start [out] 0.000213146209717 6

Python: tf-idf-cosine: to find document similarity

梦想与她 提交于 2019-12-17 03:22:23
问题 I was following a tutorial which was available at Part 1 & Part 2. Unfortunately the author didn't have the time for the final section which involved using cosine similarity to actually find the distance between two documents. I followed the examples in the article with the help of the following link from stackoverflow, included is the code mentioned in the above link (just so as to make life easier) from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction

Writing a program to scrape forums

自古美人都是妖i 提交于 2019-12-13 12:09:48
问题 I need to write a program to scrape forums. Should I write the program in Python using the Scrapy framework or should I use Php cURL? Also is there a Php equivalent to Scrapy? Thanks 回答1: I would choose Python due to superior libxml2 bindings, specifically things like lxml.html and pyQuery. Scrapy has its own libxml2 bindings, I haven't looked at them to test them, though skimming the Scrapy documentation didn't leave me very impressed (I've done lots of scraping just using these parsers and

Try to answer some boolean queries using Term-Document-Incidence-Matrix [closed]

旧城冷巷雨未停 提交于 2019-12-13 11:24:18
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 6 years ago . I try answer some simple boolean query in these ways NOT x NOT y NOT z also x AND y AND z and also like this x OR y OR z x,y,z are some words and any of them belongs to a different file.txt or maybe all of them

Vector Space Model - query vector [0, 0.707, 0.707] calculated

浪子不回头ぞ 提交于 2019-12-13 07:53:43
问题 I'm reading the book "Introduction to Information Retrieval "(Christopher Manning) and I'm stuck on the chapter 6 when it introduces the query "jealous gossip" for which it indicated that the vector unit associated is [0, 0.707, 0.707] ( https://nlp.stanford.edu/IR-book/html/htmledition/queries-as-vectors-1.html ) considering the terms affect, jealous and gossip. I tried to calculate it by computing the tf idf assuming that: - Tf is equal to 1 for jealous and gossip - Idf is always equal to 0

Confusion about precision-recall curve and average precision

放肆的年华 提交于 2019-12-13 07:27:14
问题 I'm reading a lot about Precision-Recall curves in order to evaluate my image retrieval system. In particular I'm reading this article about feature extractors in VLFeat and the wikipedia page about precision-recall. I understand that this curve is useful to evaluate our system performance w.r.t. the number of elements retrieved. So we repeatedly compute precision-recall retrieving the top element, then top 2, top 3 and so on...but my question is: when do we stop ? My intuition is: we stop

Why is solr returning result with only exact search?

怎甘沉沦 提交于 2019-12-13 06:00:01
问题 I have created a core, secondCore{id, resid, title, name, cat, role, exp} Consider a sample data: {"id" : "11","resid" : 384,"title" : "perl and java developer","name" : "appnede new name","cat" : "22,11","role" : "new role","exp" : 1 } . When I search for title:perl , I get 0 result. I get mentioned result, only if I search for title:"perl and java developer" or title:perl* . Response: <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">2</int> <lst name=

Computing the Dot Product for calculating proximity

廉价感情. 提交于 2019-12-11 20:05:57
问题 I have already asked a similar question at Calculating Word Proximity in an inverted Index. However i felt that the question was too general and not refined enough. So here goes. I have a List which contains the location of tokens in a document. for each token it goes as public List<int> hitLocation; Lets say the the document is Java programming language has a name similar to java island in Indonesia however local language in java bears no resemblance to the programming language called java.

Calculating Word Proximity in an inverted Index

一曲冷凌霜 提交于 2019-12-11 19:43:33
问题 As part of search engine i have developed an inverted index. So i have a list which contains elements of the following type public struct ForwardBarrelRecord { public string DocId; public int hits { get; set; } public List<int> hitLocation; } Now this record is against a single word. The hitLocation contains the locations where a particular word has been found in a document. Now what i want is to calculate the closeness of elements in List<int> hitLocation to another List<int> hitLocation and