How do search engines find relevant content?

后端未结

关注

 13  612

无人共我

How does Google find relevant content when it\'s parsing the web?

Let\'s say, for instance, Google uses the PHP native DOM Library to parse content. What methods would t

相关标签:

13条回答

不知归路

2021-01-29 20:26
I'm facing the same problem right now, and after some tries I found something that works for creating a webpage snippet (must be fine-tuned):
- take all the html
- remove script and style tags inside the body WITH THEIR CONTENT (important)
- remove unnecessary spaces, tabs, newlines.
- now navigate through the DOM to catch div, p, article, td (others?) and, for each one . take the html of the current element . take a "text only" version of the element content . assign to this element the score: text lenght * text lenght / html lenght
- now sort all the scores, take the greatest.
This is a quick (and dirty) way to identify longest texts with a relatively low balance of markup, like what happens in normal contents. In my tests this seems really good. Just add water ;)

In addition to this you can search for "og:" meta tags, title and description, h1 and a lot of other minor techniques.
0 讨论(0)
发布评论:

提交评论
- 加载中...
鱼传尺愫

2021-01-29 20:26

Google also uses a system called Page Rank, where it examines how many links to a site there are. Let's say that you're looking for a C++ tutorial, and you search Google for one. You find one as the top result, an it's a great tutorial. Google knows this because it searched through its cache of the web and saw that everyone was linking to this tutorial, while ranting how good it was. Google deceides that it's a good tutorial, and puts it as the top result.

It actually does that as it caches everything, giving each page a Page Rank, as said before, based on links to it.

Hope this helps!

0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2021-01-29 20:27

There are lots of highly sophisticated algorithms for extracting the relevant content from a tag soup. If you're looking to build something usable your self, you could take a look at the source code for readability and port it over to php. I did something similar recently (Can't share the code, unfortunately).

The basic logic of readability is to find all block level tags and count the length of text in them, not counting children. Then each parent node is awarded a fragment (half) of the weight of each of its children. This is used to fund the largest block level tag that has the largest amount of plain text. From here, the content is further cleaned up.

It's not bullet proof by any means, but it works well in the majority of cases.

0 讨论(0)
发布评论:

提交评论
- 加载中...
广开言路

2021-01-29 20:28

I don't work at Google but around a year ago I read they had over 200 factors for ranking their search results. Of course the top ranking would be relevance, so your question is quite interesting in that sense.

What is relevance and how do you calculate it? There are several algorithms and I bet Google have their own, but ones I'm aware of are Pearson Correlation and Euclidean Distance.

A good book I'd suggest on this topic (not necessarily search engines) is Programming Collective Intelligence by Toby Segaran (O'Reilly). A few samples from the book show how to fetch data from third-party websites via APIs or screen-scraping, and finding similar entries, which is quite nice.

Anyways, back to Google. Other relevance techniques are of course full-text searching and you may want to get a good book on MySQL or Sphinx for that matter. Suggested by @Chaoley was TSEP which is also quite interesting.

But really, I know people from a Russian search engine called Yandex here, and everything they do is under NDA, so I guess you can get close, but you cannot get perfect, unless you work at Google ;)

Cheers.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野趣味

2021-01-29 20:28

Most search engines look for the title and meta description in the head of the document, then heading one and text content in the body. Image alt tags and link titles are also considered. Last I read Yahoo was using the meta keyword tag but most don't.

You might want to download the open source files from The Search Engine Project (TSEP) on Sourceforge https://sourceforge.net/projects/tsep/ and have a look at how they do it.

0 讨论(0)
发布评论:

提交评论
- 加载中...
渐次进展

2021-01-29 20:28
There are some good answers on here, but it sounds like they don't answer your question. Perhaps this one will.

What your looking for is called Information Retrieval

It usually uses the Bag Of Words model

Say you have two documents:
```
DOCUMENT A  
Seize the time, Meribor. Live now; make now always the most precious time. Now will never come again
```
and this one
```
DOCUMENT B  
Worf, it was what it was glorious and wonderful and all that, but it doesn't mean anything
```
and you have a query, or something you want to find other relevant documents for
```
QUERY aka DOCUMENT C
precious wonderful life
```
Anyways, how do you calculate the most "relevant" of the two documents? Here's how:
1. tokenize each document (break into words, removing all non letters)
2. lowercase everything
3. remove stopwords (and, the etc)
4. consider stemming (removing the suffix, see Porter or Snowball stemming algorithms)
5. consider using n-grams
You can count the word frequency, to get the "keywords".

Then, you make one column for each word, and calculate the word's importance to the document, with respect to its importance in all the documents. This is called the TF-IDF metric.

Now you have this:
```
Doc precious worf life...
A   0.5      0.0  0.2 
B   0.0      0.9  0.0
C   0.7      0.0  0.9
```
Then, you calculate the similarity between the documents, using the Cosine Similarity measure. The document with the highest similarity to DOCUMENT C is the most relevant.

Now, you seem to want to want to find the most similar paragraphs, so just call each paragraph a document, or consider using Sliding Windows over the document instead.

You can see my video here. It uses a graphical Java tool, but explains the concepts:

http://vancouverdata.blogspot.com/2010/11/text-analytics-with-rapidminer-part-4.html

here is a decent IR book:

http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 下一页