Getting attribute's value using BeautifulSoup

后端未结

关注

 3  2040

I\'m writing a python script which will extract the script locations after parsing from a webpage. Lets say there are two scenarios :


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  鱼传尺愫        
                
              
                            
                2020-12-30 11:20
              
            
            
                                                                       
This should work, you just filter to find all the script tags, then determine if they have a 'src' attribute. If they do then the URL to the javascript is contained in the src attribute, otherwise we assume the javascript is in the tag

#!/usr/bin/python

import requests 
from bs4 import BeautifulSoup

# Test HTML which has both cases
html = '<script type="text/javascript" src="http://example.com/something.js">'
html += '</script>  <script>some JS</script>'

soup = BeautifulSoup(html)

# Find all script tags 
for n in soup.find_all('script'):

    # Check if the src attribute exists, and if it does grab the source URL
    if 'src' in n.attrs:
        javascript = n['src']

    # Otherwise assume that the javascript is contained within the tags
    else:
        javascript = n.text

    print javascript


This output of this is

http://example.com/something.js
some JS

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  时光说笑        
                
              
                            
                2020-12-30 11:41
              
            
            
                                                                       
Get 'src' from script node.

import requests 
from bs4 import BeautifulSoup

r  = requests.get("http://rediff.com/")
data = r.text
soup = BeautifulSoup(data)
for n in soup.find_all('script'):
    print "src:", n.get('src') <==== 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长情又很酷        
                
              
                            
                2020-12-30 11:43
              
            
            
                                                                       
It will get all the src values only if they are present. Or else it would skip that <script> tag

from bs4 import BeautifulSoup
import urllib2
url="http://rediff.com/"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
sources=soup.findAll('script',{"src":True})
for source in sources:
 print source['src']


I am getting following two  src values as result

http://imworld.rediff.com/worldrediff/js_2_5/ws-global_hm_1.js
http://im.rediff.com/uim/common/realmedia_banner_1_5.js


I guess this is what you want. Hope this is useful.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复