How does Google use HTML tags to enhance the search engine?

前端未结

关注

 14  3535

I know that Google’s search algorithm is mainly based on pagerank. However, it also does analysis and uses the structure of the document H1, H2,


                      
              相关标签:


      
      
        
          14条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2021-02-20 10:20
              
            
            
                                                                       
I would also suggest looking at Microformats and RDF's. Both are used to enhance searching. These are mostly search engine agnostic, but there are some specific things as well. For google specific guidelines for HTML content read this link.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一向        
                
              
                            
                2021-02-20 10:21
              
            
            
                                                                       
I suggest trying Google scholar as one of your avenues when looking for academic articles

semantic search
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤城傲影        
                
              
                            
                2021-02-20 10:22
              
            
            
                                                                       
To keep it painfully simple.  Make your information architecture logical.  If the most important elements for user comprehension are highlighted with headings and grouped logically, then the document is easier to interpret using information processing algorithms.  Magically, it will also be easier for users to interpret.  Remember the search engine algorithms were written by people trying to interpret language.

The Basic Process is:
Write well structured HTML - using header tags to indicate the most critical elements on the page.  Use logical tags based on the structure of your information.  Lists for lists, headers for major topics.  

Supply relevant alt tags and names for any visual elements, and then use simple css to arrange these elements.

If the site works well for users and contains relevant information, you don't risk becoming a black listed spammer, and search engine algorithms will favor your page.

I really enjoyed the book Transcending CSS
for a clean explanation of properly structured HTML.  
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2021-02-20 10:25
              
            
            
                                                                       
I believe what you are interested in is called structural-fingerprinting, and it is often used to determine the similarity of two structures. In Google's case, applying a weight to different tags and applying to a secret algorithm that (probably) uses the frequencies of the different elements in the fingerprint.  This is deeply routed in information theory - if you are looking for academic papers on information theory, I would start with "A Mathematical Theory of Communication"  by Claude Shannon 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  佛祖请我去吃肉        
                
              
                            
                2021-02-20 10:29
              
            
            
                                                                       
I found it interesting that - with no meta keywords nor description provided - in a scenatio like this:

<p>Some introduction</p>
<h1>headline 1</h1>
<p>text for section one</p>


Always the "text for section one" is shown on the search result page.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  刺人心        
                
              
                            
                2021-02-20 10:31
              
            
            
                                                                       
In short; very carefully. In long:

Quote from anatomy of a large-scale hypertextual erb search engine:


  [...] This gives us some limited
  phrase searching as long as there are
  not that many anchors for a particular
  word. We expect to update the way that
  anchor hits are stored to allow for
  greater resolution in the position and
  docIDhash fields. We use font size
  relative to the rest of the document
  because when searching, you do not
  want to rank otherwise identical
  documents differently just because one
  of the documents is in a larger
  font. [...]


It goes on:


  [...] Another big difference between
  the web and traditional well controlled collections is that there
  is virtually no control over what
  people can put on the web. Couple
  this flexibility to publish anything
  with the enormous influence of search
  engines to route traffic and companies
  which deliberately manipulating search
  engines for profit become a serious
  problem. This problem that has not
  been addressed in traditional closed
  information retrieval systems. Also,
  it is interesting to note that
  metadata efforts have largely failed
  with web search engines, because any
  text on the page which is not directly
  represented to the user is abused to
  manipulate search engines. [...]


The Challenges in a web search engine addresses these issues in a more modern fashion:


  [...] Web pages in HTML fall into the middle of this continuum of structure in documents, being neither close to free text nor to well-structured data. Instead HTML markup provides limited structural information, typically used to control layout but providing clues about semantic information. Layout information in HTML may seem of limited utility, especially compared to information contained in languages like XML that can be used to tag content, but in fact it is a particularly valuable source of meta-data in unreliable corpora such as the web. The value in layout information stems from the fact that it is visible to the user [...]:


And adds: 


  [...] HTML tags can be analyzed for what semantic information can be inferred. In addition to the header tags mentioned above, there are tags that control the font face (bold, italic), size, and color. These can be analyzed to determine which words in the document the author thinks are particularly important. One advantage of HTML, or any markup language that maps very closely to how the content is displayed, is that there is less opportunity for abuse: it is difficult to use HTML markup in a way that encourages search engines to think the marked text is important, while to users it appears unimportant. For instance, the fixed meaning of the  tag means that any text in an HI context will appear prominently on the rendered web page, so it is safe for search engines to weigh this text highly. However, the reliability of HTML markup is decreased by Cascading Style Sheets which separate the names of tags from their representation. There has been research in extracting information from what structure HTML does possess.For instance, [Chakrabarti etal, 2001; Chakrabarti, 2001] created a DOM tree of an HTML page and used this information to in-crease the accuracy of topic distillation, a link-based analysis technique.


There are number of issues a modern search engine needs to combat, for example web spam and blackhat SEO schemes.  


Combating webspam with trustrank 
Webspam taxonomy
Detecting spam web pages through content analysis 


But even in a perfect world, e.g. after eliminating the bad apples from the index, the web is still an utter mess because no-one has identical structures. There are maps, games, video, photos (flickr) and lots and lots of user generated content. In other word, the web is still very unpredictable.

Resources


Hypertext and the web:


Extracting knowledge from the World Wide Web
Rich media and web 2.0
Thresher: automating the unwrapping of semantic content from the World Wide Web
Information retrieval

Webspam papers


Combating webspam with trustrank 
Webspam taxonomy
Detecting spam web pages through content analysis 


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复