Extract `src` attribute from `img` tag using BeautifulSoup

前端 未结 4 2029
隐瞒了意图╮
隐瞒了意图╮ 2020-11-29 09:13

I use bs4 an

相关标签:
4条回答
  • 2020-11-29 09:44

    You can use BeautifulSoup to extract src attribute of an html img tag. In my example, the htmlText contains the img tag itself but this can be used for a URL too along with urllib2.

    The solution provided by the most rated answer is not working any more with python3. This is the correct implementation:

    For URLs

    from bs4 import BeautifulSoup as BSHTML
    import urllib3
    
    http = urllib3.PoolManager()
    url = 'your_url'
    
    response = http.request('GET', url)
    soup = BSHTML(response.data, "html.parser")
    images = soup.findAll('img')
    
    for image in images:
        #print image source
        print(image['src'])
        #print alternate text
        print(image['alt'])
    

    For Texts with img tag

    from bs4 import BeautifulSoup as BSHTML
    htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
    soup = BSHTML(htmlText)
    images = soup.findAll('img')
    for image in images:
        print(image['src'])
    
    0 讨论(0)
  • 2020-11-29 09:44

    here is a solution that will not trigger a KeyError in case the img tag does not have a src attribute:

    from urllib.request import urlopen
    from bs4 import BeautifulSoup
    
    site = "[insert name of the site]"
    html = urlopen(site)
    bs = BeautifulSoup(html, 'html.parser')
    
    images = bs.find_all('img')
    for img in images:
        if img.has_attr('src'):
            print(img['src'])
    
    0 讨论(0)
  • 2020-11-29 09:49

    A link doesn't have attribute src you have to target actual img tag.

    import bs4
    
    html = """<div class="someClass">
        <a href="href">
            <img alt="some" src="some"/>
        </a>
    </div>"""
    
    soup = bs4.BeautifulSoup(html, "html.parser")
    
    # this will return src attrib from img tag that is inside 'a' tag
    soup.a.img['src']
    
    >>> 'some'
    
    # if you have more then one 'a' tag
    for a in soup.find_all('a'):
        if a.img:
            print(a.img['src'])
    
    >>> 'some'
    
    0 讨论(0)
  • 2020-11-29 09:56

    You can use BeautifulSoup to extract src attribute of an html img tag. In my example, the htmlText contains the img tag itself but this can be used for a URL too along with urllib2.

    For URLs

    from BeautifulSoup import BeautifulSoup as BSHTML
    import urllib2
    page = urllib2.urlopen('http://www.youtube.com/')
    soup = BSHTML(page)
    images = soup.findAll('img')
    for image in images:
        #print image source
        print image['src']
        #print alternate text
        print image['alt']
    

    For Texts with img tag

    from BeautifulSoup import BeautifulSoup as BSHTML
    htmlText = """<img src="https://src1.com/" <img src="https://src2.com/" /> """
    soup = BSHTML(htmlText)
    images = soup.findAll('img')
    for image in images:
        print image['src']
    
    0 讨论(0)
提交回复
热议问题