information-retrieval

Information Retrieval :URL hits in a time frame

心已入冬 提交于 2019-12-11 18:58:46
问题 Algorithm Challenge : Problem statement : How would you design a logging system for something like Google , you should be able to query for the number of times a URL was opened within two time frames. i/p : start_time , end_time , URL1 o/p : number of times URL1 was opened between start and end time. Some specs : Database is not an optimal solution A URL might have been opened multiple times for given time stamp. A URL might have been opened a large number of times within two time stamps.

Create a Corpus Containing the Vocabulary of Words

强颜欢笑 提交于 2019-12-11 17:28:41
问题 I am calculating inverse_document_frequency for all the words in my documents dictionary and I have to show the top 5 documents ranked according to the score on queries. But I am stuck in loops while creating corpus containing the vocabulary of words in the documents. Please help me to improve my code. This Block of code used to read my files and removing punctuation and stop words from a file def wordList(doc): """ 1: Remove Punctuation 2: Remove Stop Words 3: return List of Words """ file =

IR and QA - Beginner Project Scope

五迷三道 提交于 2019-12-11 13:31:13
问题 I have been brainstorming for an Undergraduate Project in Question Answering domain. A project that has components of IR and NLP. The first thing that popped up, was of course factoid question answering, but that seemed to be an already conquered problem. #IBM Watson! Non-factoid QA seems interesting, so I took it up. Now, we are in scope-it-out phase of the project description. So, from the ambitious goal - of answering any question put up by the user - I need to scope out our project. So I

synonyms offline Dictionary for a search application

主宰稳场 提交于 2019-12-11 10:26:24
问题 i'm trying to build a smart search engine application that gets synonyms of the words in the Question and Query my database with each of the generated synonyms the problem is that i'm searching for a way to get all synonyms of the words in the Question using a dictionary or something. that could in the end offers 1- direct synonyms like : file > movie , football > soccer 2- could offer a matchstring like : population size > number of citizens (optional ) 3- something that is fast and reliable

Tf-idf of strings from csv file

我们两清 提交于 2019-12-11 07:25:14
问题 My test.csv file is (without header): very good, very bad, you are great very bad, good restaurent, nice place to visit I want to make my corpus separated with , so that my final DocumentTermMatrix becomes: terms docs very good very bad you are great good restaurent nice place to visit doc1 tf-idf tf-idf tf-idf 0 0 doc2 0 tf-idf 0 tf-idf tf-idf I am able to produce the above DTM correctly, if I don't load the documents from csv file , like below: library(tm) docs <- c(D1 = "very good, very

How cosine similarity differs from Okapi BM25?

半腔热情 提交于 2019-12-11 03:07:34
问题 I'm conducting a research using elasticsearch. I was planning to use cosine similarity but I noted that it is unavailable and instead we have BM25 as default scoring function. Is there a reason for that? Is cosine similarity improper for querying documents? Why was BM25 chosen as default? Thanks 回答1: Longtime elasticsearch use TF/IDF algorithm to find similarity in queries. But number versions ago is changed to BM25 as more efficient. You can read the information in the documentation. And

Storing an inverted index

大兔子大兔子 提交于 2019-12-10 17:49:57
问题 I am working on a project on Info Retrieval. I have made a Full Inverted Index using Hadoop/Python. Hadoop outputs the index as (word,documentlist) pairs which are written on the file. For a quick access, I have created a dictionary(hashtable) using the above file. My question is, how do I store such an index on disk that also has quick access time. At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or

How does Facebook Graph Search work? [closed]

落花浮王杯 提交于 2019-12-10 17:39:37
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 6 years ago . I'm guessing I'm on thin ice with regards to a question that can be answered instead of just discussed as from my research Facebook's Graph Search seems to be in stealth mode with nothing much officially shared

Dynamic regex for date time formats

[亡魂溺海] 提交于 2019-12-10 16:34:34
问题 Is there an existing solution to create regular expressions dynamically out of given date time format pattern? Supported date time format pattern does not matter (Joda DateTimeFormat, java.text.SimpleDateTimeFormat or others). i.e. for a given date-time format (for example "dd/MM/yyyy hh:mm"), it will generate corresponding regular expression to match the date-times within the specified formats. 回答1: I guess you have a limited alphabet that your time formats can be constructed of. That means,

retrieve information from a url

倾然丶 夕夏残阳落幕 提交于 2019-12-10 11:08:50
问题 I want to make a program that will retrieve some information a url. For example i give the url below, from librarything How can i retrieve all the words below the "TAGS" tab, like Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ? I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice? EDIT: You gave me excellent help, but I want to ask something else. For every tag we can see how many times