information-retrieval

Function Score Query elasticsearch parsing error

烂漫一生 提交于 2019-12-10 10:19:51
问题 I am trying to run a straight forward Function Score Query in elasticsearch as: { "function_score": { "query": { "term": { "timestamp": { "value": 1396361509, "boost": 0.05 } } }, "script_score": { "script": "abs(1396361509 - doc['timestamp'].value)" } } } but I keep getting an error saying that there is no parser for "function_score": SearchParseException[[test_index][4]: from[-1],size[-1]: Parse Failure [No parser for element [function_score]]]; }{[PKoYz4OLTbOWb6ziP8AIaQ][test_index][1]:

Building a fast semantic MySQL search engine for private articles from scratch

感情迁移 提交于 2019-12-09 04:59:11
问题 I am working on a project that will involve full-text and semantic searches of articles within the site (if it's not possible to combine it, the user can select either option). These articles are subscription-based and can only be searched after logging in; so they are not accessible to external search engines or their APIs. I read about Sphinx for full text keywords searches (and I intend to implement it for that aspect) but I am not sure how to go about building a semantic search engine out

Combine solr's document score with a static, indexed score

前提是你 提交于 2019-12-08 10:37:56
问题 I have people indexed into solr based on documents that they have authored. For simplicity's sake, let's say they have three fields - an integer ID, a Text field and a floating point 'SpecialRank' (a value between 0 and 1 to indicate how great the person is). Relevance matching in solr is all done through the Text field. However, I want my final result list to be a combination of relevance to the query as provided by solr and my own SpecialRank. Namely, I need to re-rank the results based on

Language Modal through whoosh in Information Retrieval

浪尽此生 提交于 2019-12-08 08:06:00
问题 I am working in IR. Can any one guide me, how can I implement the language modal in whoosh. I already Applied TD-IDF and BM25. I am new to IR. For an example, the simplest form of language model simply throws away all conditioning context, and estimates each term independently. Such a model is called a unigram language model: P_{uni}(t_1t_2t_3t_4) = P(t_1)P(t_2)P(t_3)P(t_4) There are many more complex kinds of language models, such as bigram language models, which condition on the previous

Positives/negatives proportion in train set

半城伤御伤魂 提交于 2019-12-08 07:59:57
问题 I'm trying to get Rocchio algorithm for relevance feedback to work. I have a query, and a few documents marked positives and negatives. For example, I have 60 positives and 337 negatives. I want to train my model(in this case - adjust the query) using part of this dataset and test it on the other part. But having this kind of imbalanced dataset i'm not sure how many negatives and how many positives to take into training set. Another problem is that depending on the positives/negatives

Is there an algorithm for determining the relevance of a text to a theme?

浪尽此生 提交于 2019-12-08 03:11:50
问题 I want to know what can be used to determine the relevance of a page for a theme like games, movies, etc. Is there some research in this area or is there only counting how many times some relevant words appear? 回答1: The common choice is supervised document classification on bag of words (or bag of n-grams) features, preferably with tf-idf weighting. Popular algorithms include Naive Bayes and (linear) SVMs. For this approach, you'll need labeled training data, i.e. documents annotated with

Which open-source search engine should be used? [closed]

拟墨画扇 提交于 2019-12-08 01:32:10
问题 Closed . This question is opinion-based. It is not currently accepting answers. Want to improve this question? Update the question so it can be answered with facts and citations by editing this post. Closed 6 years ago . My aim is to build an aggregrator of news feeds and blog feeds so as to make searching/tracking of entitites in it easy. I have been looking at many solutions out there like Terrier, Lucene, SWISH-E, etc. Basically, I could find only 2 sources of comparison studies done on

Relevance of document to multiple keywords

你。 提交于 2019-12-07 16:26:15
问题 Suppose D is a textual document, and K = < k1, ..., kN > represents a set of terms contained in the document. For instance: D = "What a wonderful day, isn't it?" K = <"wonderful","day"> My objective is to see if document D talks about all the words in K as a whole. For instance: D = "The Ebola in Africa is spreading at high speed" K = <"Ebola","Africa"> is a case in which D is strongly related to K , while: D = "NEWS 1: Ebola is a dangerous disease that is causing thousands of deaths. Many

How to handle huge sparse matrices construction using Scipy?

♀尐吖头ヾ 提交于 2019-12-07 15:29:34
问题 So, I am working on a Wikipedia dump to compute the pageranks of around 5,700,000 pages give or take. The files are preprocessed and hence are not in XML. They are taken from http://haselgrove.id.au/wikipedia.htm and the format is: from_page(1): to(12) to(13) to(14).. from_page(2): to(21) to(22).. . . . from_page(5,700,000): to(xy) to(xz) so on. So. basically it's a construction of a [5,700,000*5,700,000] matrix, which would just break my 4 gigs of RAM. Since, it is very-very Sparse, that

Return all tweets from my timeline

。_饼干妹妹 提交于 2019-12-07 11:21:16
问题 I wish to return ALL the Tweets I have ever posted on my timeline. I am using the Linq To Twitter library as so - var statusTweets = from tweet in twitterCtx.Status where tweet.Type == StatusType.User && tweet.UserID == MyUserID && tweet.Count == 200 select tweet; statusTweets.ToList().ForEach( tweet => Console.WriteLine( "Name: {0}, Tweet: {1}\n", tweet.User.Name, tweet.Text)); This works fine and brings back the first 200. However the first 200 seems to be the maximum I can retrieve, as