Getting attribute's value using BeautifulSoup

后端 未结 3 2038
予麋鹿
予麋鹿 2020-12-30 10:57

I\'m writing a python script which will extract the script locations after parsing from a webpage. Lets say there are two scenarios :



        
相关标签:
3条回答
  • 2020-12-30 11:20

    This should work, you just filter to find all the script tags, then determine if they have a 'src' attribute. If they do then the URL to the javascript is contained in the src attribute, otherwise we assume the javascript is in the tag

    #!/usr/bin/python
    
    import requests 
    from bs4 import BeautifulSoup
    
    # Test HTML which has both cases
    html = '<script type="text/javascript" src="http://example.com/something.js">'
    html += '</script>  <script>some JS</script>'
    
    soup = BeautifulSoup(html)
    
    # Find all script tags 
    for n in soup.find_all('script'):
    
        # Check if the src attribute exists, and if it does grab the source URL
        if 'src' in n.attrs:
            javascript = n['src']
    
        # Otherwise assume that the javascript is contained within the tags
        else:
            javascript = n.text
    
        print javascript
    

    This output of this is

    http://example.com/something.js
    some JS
    
    0 讨论(0)
  • 2020-12-30 11:41

    Get 'src' from script node.

    import requests 
    from bs4 import BeautifulSoup
    
    r  = requests.get("http://rediff.com/")
    data = r.text
    soup = BeautifulSoup(data)
    for n in soup.find_all('script'):
        print "src:", n.get('src') <==== 
    
    0 讨论(0)
  • 2020-12-30 11:43

    It will get all the src values only if they are present. Or else it would skip that <script> tag

    from bs4 import BeautifulSoup
    import urllib2
    url="http://rediff.com/"
    page=urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    sources=soup.findAll('script',{"src":True})
    for source in sources:
     print source['src']
    

    I am getting following two src values as result

    http://imworld.rediff.com/worldrediff/js_2_5/ws-global_hm_1.js
    http://im.rediff.com/uim/common/realmedia_banner_1_5.js
    

    I guess this is what you want. Hope this is useful.

    0 讨论(0)
提交回复
热议问题