What searching algorithm/concept is used in Google?

前端 未结 8 1573
渐次进展
渐次进展 2021-02-02 03:37

What searching algorithm/concept is used in Google?

相关标签:
8条回答
  • 2021-02-02 04:17

    Inverted index and MapReduce is the basics of most search engines (I believe). You create an index on the content and run queries against that index to display relevance. Google however does much more than just a simple index of where each word occurs, they also do how many times it appeared, where it appears, where it appears in relation to other words, the ordering, etc. Another simple concept that's used is "stop words" which may include things like "and", "the", and so on (basically "simple" words that occur often and generally not the focus of a query). In addition, they employ things like Page Rank (mentioned by TStamper) to order pages by relevance and importance.

    MapReduce is basically taking one job and dividing it into smaller jobs and letting those smaller jobs run on many systems (in parts for scalability and in parts for speed). If I recall correctly, Google was able to make use of "average" computers to distribute jobs to instead of server-grade computers. Since the processing capability of one computer is reaching a peak, many technology are heading towards cloud computing where a job is done by many physical machines.

    I'm not sure how much searching Google does, it's more accurately crawling. The difference lies in that they just start at specific points and crawl to anything reachable and repeat until they hit some sort of dead-end.

    0 讨论(0)
  • 2021-02-02 04:21

    The Anatomy of a Large-Scale Hypertextual Web Search Engine

    0 讨论(0)
  • 2021-02-02 04:25

    While being interested in the page rank algorithm and similar I was disturbed to discover that the introduction of personal search at the turn of the year (not widely commented on) seems to change quite a lot - see Failure of the Google Gold Standard and Google’s Personalized Results

    0 讨论(0)
  • 2021-02-02 04:29

    Google's patented PigeonRank™

    Wow, they initially posted this 7 years ago from Wednesday ...

    0 讨论(0)
  • 2021-02-02 04:33

    Indexing

    If you want to get down to basics:

    Google uses an inverted index of the Internet. What this means is that Google has an index of all pages it's crawled based on the terms in each page. For instance the term Google maps to this page, the Google home page, and the Wikipedia article for Google, amongst others.

    Thus, when you go to Google and type "Google" into the search box, Google checks its index of all terms available on the Internet and finds the entry for the term "Google" and with it the list of all pages that have that term referenced in it.

    For veteran users:

    Google's index goes beyond your simple inverted index, however. This is why Google is the best. Google's crawlers (spiders) are smart. Very smart. Beyond just keeping track of the terms that are on any given web page, they also keep track of words that are on related pages and link those to the given document.

    In other words, if a page has the term Google in it and the page has a link to or is linked from another web page, the other page may be referenced in the index under the term Google as well. All this and more go into why a given page is returned for a given query.

    If you want to go into why pages are ordered the way they are in your search results, that gets into even more interesting stuff.

    Ranking

    To get down to basics:

    Perhaps one of the most basic algorithms a search engine can use to sort your results is known as term frequency-inverse document frequency (tf-idf). Simply put, this means that your results will be ordered by the relative importance of your search terms in the document. In other words, a document that has 10 pages and lists the word Google once is not nearly as important as a document that has 1 page and lists the word Google ten times.

    For veteran users:

    Again, Google does quite a bit more than your basic search engine when it comes to ranking results. Google has implemented the aforementioned, patented, PageRank algorithm. In short form, PageRank enhances the tf-idf algorithm by taking into account the populatirty/importance of a given page. At this point, popularity/importance may be judged by any number of factors that Google just wont tell us. However, at the most basic of levels, Google can tell that one page is more important than another because loads and loads of other pages link to it.

    0 讨论(0)
  • 2021-02-02 04:35

    I think "The Anatomy of a Large-Scale Hypertextual Web Search Engine" is a little outdated. Hier a recent talk about scalability: Challenges in Building Large-Scale Information Retrieval Systems

    0 讨论(0)
提交回复
热议问题