发表新帖

发表新帖

How do you Index Files for Fast Searches?

前端未结

关注

 3  1721

太阳男子 2021-01-31 12:51

Nowadays, Microsoft and Google will index the files on your hard drive so that you can search their contents quickly.

What I want to know is how do they do this? Can yo

3条回答

[愿得一人] (楼主)

2021-01-31 13:13
The simple case is an inverted index.

The most basic algorithm is simply:
- scan the file for words, creating a list of unique words
- normalize and filter the words
- place an entry tying that word to the file in your index
The details are where things get tricky, but the fundamentals are the same.

By "normalize and filter" the words, I mean things like converting everything to lowercase, removing common "stop words" (the, if, in, a etc.), possibly "stemming" (removing common suffixes for verbs and plurals and such).

After that, you've got a unique list of words for the file and you can build your index off of that.

There are optimizations for reducing storage, techniques for checking locality of words (is "this" near "that" in the document, for example).

But, that's the fundamental way it's done.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...

热议问题