inverted-index

Calculating Word Proximity in an inverted Index

一曲冷凌霜 提交于 2019-12-11 19:43:33
问题 As part of search engine i have developed an inverted index. So i have a list which contains elements of the following type public struct ForwardBarrelRecord { public string DocId; public int hits { get; set; } public List<int> hitLocation; } Now this record is against a single word. The hitLocation contains the locations where a particular word has been found in a document. Now what i want is to calculate the closeness of elements in List<int> hitLocation to another List<int> hitLocation and

Storing an inverted index

大兔子大兔子 提交于 2019-12-10 17:49:57
问题 I am working on a project on Info Retrieval. I have made a Full Inverted Index using Hadoop/Python. Hadoop outputs the index as (word,documentlist) pairs which are written on the file. For a quick access, I have created a dictionary(hashtable) using the above file. My question is, how do I store such an index on disk that also has quick access time. At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or

Some questions related to SphinxSE and RT indexes

↘锁芯ラ 提交于 2019-12-08 08:32:31
问题 I consider using Sphinx search in one of my projects so I have a few questions related to it. When using SphinxSE and RT index, every UPDATE or INSERT in the SphinxSE table will update the index, right? No need to call indexer or anything? Can I search on both tags (user entered keywords for a document) and the content and give more relevance to the tag matches? And if it's possible how do I implement the tag search (now I have them in separate tables like an inverted index) For the fillter

Using cPickle to serialize a large dictionary causes MemoryError

那年仲夏 提交于 2019-12-05 23:03:35
问题 I'm writing an inverted index for a search engine on a collection of documents. Right now, I'm storing the index as a dictionary of dictionaries. That is, each keyword maps to a dictionary of docIDs->positions of occurrence. The data model looks something like: {word : { doc_name : [location_list] } } Building the index in memory works fine, but when I try to serialize to disk, I hit a MemoryError. Here's my code: # Write the index out to disk serializedIndex = open(sys.argv[3], 'wb') cPickle

what is the best way to build inverted index?

℡╲_俬逩灬. 提交于 2019-12-04 22:43:30
I'm building a small web search engine for searching about 1 million web pages and I want to know What is the best way to build the inverted index ? using the DBMS or What …? from many different views like storage cost, performance, speed of indexing and query? and I don't want to use any open source project for that I want to make my own one! Perhaps you might want to elaborate why you do not wish to use F/OSS tools like Lucene or Sphinx. Most of the current closed-source database managers have some sort of full-text indexing capability. Given its popularity, I'd guess most also have pre

hadoop inverted-index without recurrence of file names

青春壹個敷衍的年華 提交于 2019-12-04 18:12:48
what i have in output is: word , file ----- ------ wordx Doc2, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1 what i want is: word , file ----- ------ wordx Doc2, Doc1 public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text(); public void map(LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { FileSplit fileSplit = (FileSplit) reporter.getInputSplit(); String fileName = fileSplit.getPath().getName();

How to search phrase queries in inverted index structure?

我只是一个虾纸丫 提交于 2019-12-03 15:14:48
If we want to search a query like this "t1 t2 t3" (t1,t2 ,t3 must be queued) in an inverted index structure , which ways should we do ? 1-First we search the "t1" term and find all documents that contains "t1" , then do this work for "t2" and then "t3" . Then find documents that positions of "t1" , "t2" and "t3" are next to each other . 2-First we search the "t1" term and find all documents that contains "t1" , then in all documents that we found , we search the "t2" and next , in the result of this , we find documents that contains "t3" . I have a full inverted index . I want to know which

What is the difference between a secondary index and an inverted index in Cassandra?

久未见 提交于 2019-12-03 13:27:23
问题 When I read about these two, I thought both of them are explaining the same approach, I googled but found nothing. Is the difference in implementation? Cassandra does the secondary index itself but inverted index has to be implemented by myself? Which is faster in searching, by the way? 回答1: The main difference is that secondary indexes in Cassandra are not distributed in the same way a manual inverted index would be. With the inbuilt secondary indexes, each node indexes the data it stores

What is the difference between a secondary index and an inverted index in Cassandra?

孤人 提交于 2019-12-03 03:29:55
When I read about these two, I thought both of them are explaining the same approach, I googled but found nothing. Is the difference in implementation? Cassandra does the secondary index itself but inverted index has to be implemented by myself? Which is faster in searching, by the way? The main difference is that secondary indexes in Cassandra are not distributed in the same way a manual inverted index would be. With the inbuilt secondary indexes, each node indexes the data it stores locally (using the LocalPartitioner). With manual indexing, the indexes are distributed independently of the

How do search engines merge results from an inverted index?

限于喜欢 提交于 2019-12-02 21:49:37
How do search engines merge results from an inverted index? For example, if I searched for the inverted indexes of the words "dog" and "bat", there would be two huge lists of every document which contained one of the two words. I doubt that a search engine walks through these lists, one document at a time, and tries to find matches with the results of the lists. What is done algorithmically to make this merging process blazing fast? jkff Actually search engines do merge these document lists. They gain good performance by using other techniques, the most important of which is pruning: for