information-retrieval

How to detect duplicates among text documents and return the duplicates' similarity?

荒凉一梦 提交于 2019-12-28 03:10:09
问题 I'm writing a crawler to get content from some website, but the content can duplicated, I want to avoid that. So I need a function can return the same percent between two text to detect two content maybe duplicated Example: Text 1:"I'm writing a crawler to" Text 2:"I'm writing a some text crawler to get" The compare function will return text 2 as the same text 1 by 5/8%(with 5 is words number of text 2 same text 1(compare by word order), and 8 is total words of text 2). If remove the "some

Find input image (ID,passport) in imagesDB based on similarity

帅比萌擦擦* 提交于 2019-12-25 18:53:49
问题 I would like to decide if an image is present in a list stored in a DB (e.g. pictures of IDs, passport, Stu. card, etc). I thought about using a KNN algorithm, that will plot the K closest images. Options for distance metric: sum of Euclidean distance between each relative pixels (img1[pixel_i], img2[pixel_i]) sum of Euclidean distance betwen each pixel to each other, multiplied by some factor decreasing with distance (pixel to pixel) same as above, but with manhattan... Do you know/think of

Query about SVM mapping of input vector? And SVM optimization equation

ⅰ亾dé卋堺 提交于 2019-12-24 09:58:44
问题 I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification. So my first question is

Extract statistical information from Wikipedia article

十年热恋 提交于 2019-12-24 09:57:56
问题 I'm currently extracting data from DBpedia articles using a SPARQLWrapper for python, but I can't seem to find how to extract the number of watchers (and other statistical information) for a given article. Is there an easy way to achieve this? I don't mind if it's through DBpedia, or directly through wikipedia (using wget, for example). Thanks for any advice. 回答1: It shell be prohibited to get the number of watchers for every arbitrary article, as it is considered to be a security leak if

Apache Lucene doesn't filter stop words despite the usage of StopAnalyzer and StopFilter

断了今生、忘了曾经 提交于 2019-12-21 21:34:10
问题 I have a module based on Apache Lucene 5.5 / 6.0 which retrieves keywords. Everything is working fine except one thing — Lucene doesn't filter stop words. I tried to enable stop word filtering with two different approaches. Approach #1: tokenStream = new StopFilter(new ASCIIFoldingFilter(new ClassicFilter(new LowerCaseFilter(stdToken))), EnglishAnalyzer.getDefaultStopSet()); tokenStream.reset(); Approach #2: tokenStream = new StopFilter(new ClassicFilter(new LowerCaseFilter(stdToken)),

SPARQL query against DBPedia to get all property-value of the item

允我心安 提交于 2019-12-20 07:25:54
问题 I am a novice in Semantic Web and I would like to retrieve all property-value pairs of "apple" from DBPedia using SPARQL query. Below I have written the query in http://dbpedia.org/sparql editor, but it returns no any results.Could you tell me where I make a mistake, please? PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> prefix dbo: <http://dbpedia.org/ontology/> prefix owl: <http://www.w3.org/2002/07/owl#> prefix prov: <http:/

Document search on partial words

感情迁移 提交于 2019-12-18 15:16:24
问题 I am looking for a document search engine (like Xapian, Whoosh, Lucene, Solr, Sphinx or others) which is capable of searching partial terms. For example when searching for the term "brit" the search engine should return documents containing either "britney" or "britain" or in general any document containing a word matching r *brit* Tangentially, I noticed most engines use TF-IDF (Term frequency-Inverse document frequency) or its derivatives which are based on full terms and not partial terms.

TF-IDF implementations in python

狂风中的少年 提交于 2019-12-18 11:26:56
问题 What are the standard tf-idf implementations/api available in python? I've come across the one in nltk. I want to know the other libraries that provide this feature. 回答1: there is a package called scikit which calculates tf-idf scores. you can refer to my answer to this question Python: tf-idf-cosine: to find document similarity and also see the question code from this. Thankz. 回答2: Try the libraries which implements TF-IDF algorithm in python. http://code.google.com/p/tfidf/ https://github

Confusion about (Mean) Average Precision

我怕爱的太早我们不能终老 提交于 2019-12-18 04:57:06
问题 In this question I asked clarifications about the precision-recall curve. In particular, I asked if we have to consider a fixed number of rankings to draw the curve or we can reasonably choose ourselves. According to the answer, the second one is correct. However now I have a big doubt about the Average Precision (AP) value: AP is used to estimate numerically how good is our algorithm given a certain query. Mean Average Precision (MAP) is average precision on multiple queries. My doubt is: if

What is the default list of stopwords used in Lucene's StopFilter?

大兔子大兔子 提交于 2019-12-17 09:33:25
问题 Lucene have a default stopfilter (http://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/core/StopFilter.html), does anyone know which are words in the list? 回答1: The default stop words set in StandardAnalyzer and EnglishAnalyzer is from StopAnalyzer.ENGLISH_STOP_WORDS_SET , and they are: "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these"