How to set up XPath query for HTML parsing?

前端未结
关注
 1  502
长发绾君心 2021-01-25 05:47
Here is some HTML code from http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0 in Google Chrome that I want to parse the website for some project.

      
      
        
          1条回答        

        
                    
            
            
                         
                
              
              
                
                   感情败类
                                             
                
                
                (楼主)
            
              
              
                2021-01-25 06:04
              

            
            
                        
It is important to inspect the string returned by page.text and not
just rely on the page source as returned by your Chrome browser. Web sites can
return different content depending on the User-Agent, and moreover, GUI browsers
such as your Chrome browser may change the content by executing JavaScript while
in contrast, requests.get does not.

If you write the contents to a file 

import requests
page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/rn/75-07-0') 
with open('/tmp/test', 'wb') as f:
     f.write(page.text)


and use a text editor to search for "yui_3_18_1_3_1434380225687_700"
you'll find that there is no tag with that attribute value.

If instead you search for Name of Substance you'll find


Search for this InChIKey on the Web
Names and Synonyms
Name of Substance
Acetaldehyde


Therefore, instead you could use:

In [219]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0].text_content()
Out[219]: 'Acetaldehyde'




How this XPath was found:

Starting from the 
 tag:

In [215]: tree.xpath('//*[text()="Name of Substance"]')
Out[215]: []


The 
 tag that we want is not a child but rather it is a subchild of the parent of . Therefore, go up to the parent:

In [216]: tree.xpath('//*[text()="Name of Substance"]/..')
Out[216]: []


and then use //div to search for all 
s inside the parent:

In [217]: tree.xpath('//*[text()="Name of Substance"]/..//div')
Out[217]: 
[,
 ,
 ...]


The first div is the one that we want:

In [218]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0]
Out[218]: 


and we can extract the text using the text_content method:

In [219]: tree.xpath('//*[text()="Name of Substance"]/..//div')[0].text_content()
Out[219]: 'Acetaldehyde'

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                    
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复
            
          
        
      
       
      
    
    
          
 
     
 
        热议问题