Python Beautifulsoup img tag parsing

后端 未结 6 1510
旧巷少年郎
旧巷少年郎 2021-01-06 02:48

I am using beautifulsoup to parse all img tags which is present in \'www.youtube.com\'

The code is

import urllib2
from BeautifulSoup import Beautiful         


        
相关标签:
6条回答
  • 2021-01-06 03:14

    Explicitly using soup.findAll(name='img') worked for me, and I don't appear to be missing anything from the page.

    0 讨论(0)
  • 2021-01-06 03:20

    Seems to work when I try it here

    import urllib2
    from BeautifulSoup import BeautifulSoup
    page = urllib2.urlopen('http://www.youtube.com/')
    soup = BeautifulSoup(page)
    tags=soup.findAll('img')
    print "\n".join(set(tag['src'] for tag in tags))
    

    Produces this which looks OK to me

    http://i1.ytimg.com/vi/D9Zg67r9q9g/market_thumb.jpg?v=723c8e
    http://s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif
    //s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif
    /gen_204?a=fvhr&v=mha7pAOfqt4&nocache=1337083207.97
    http://i3.ytimg.com/vi/fNs8mf2OdkU/market_thumb.jpg?v=4f85544b
    http://i4.ytimg.com/vi/CkQFjyZCq4M/market_thumb.jpg?v=4f95762c
    http://i3.ytimg.com/vi/fzD5gAecqdM/market_thumb.jpg?v=b0cabf
    http://i3.ytimg.com/vi/2M3pb2_R2Ng/market_thumb.jpg?v=4f0d95fa
    //i2.ytimg.com/vi/mha7pAOfqt4/hqdefault.jpg
    
    0 讨论(0)
  • 2021-01-06 03:21

    Try this.

    from simplified_scrapy import SimplifiedDoc, req
    url = 'https://www.youtube.com'
    html = req.get(url)
    doc = SimplifiedDoc(html)
    imgs = doc.listImg(url = url)
    print([img.url for img in imgs])
    
    imgs = doc.selects('img')
    for img in imgs:
      print (img)
      print (doc.absoluteUrl(url,img.src))
    
    0 讨论(0)
  • 2021-01-06 03:31

    in my case some images didn't contain src.

    so i did this to avoid keyError exception:

    art_imgs = set(img['src'] for img in article.find_all('img') if img.has_attr('src')) 
    
    0 讨论(0)
  • 2021-01-06 03:34

    I had the similar problem. I couldn't find all images. So here is the piece of code that will give you any attribute value of an image tag.

    from BeautifulSoup import BeautifulSoup as BSHTML
    import urllib2
    page = urllib2.urlopen('http://www.youtube.com/')
    soup = BSHTML(page)
    images = soup.findAll('img')
    for image in images:
        #print image source
        print image['src']
        #print alternate text
        print image['alt']
    
    0 讨论(0)
  • 2021-01-06 03:39
    def grabimagetags():
    import urllib2
    from BeautifulSoup import BeautifulSoup
    page = urllib2.urlopen('http://www.youtube.com/')
    soup = BeautifulSoup(page)
    tags = soup.findAll('img')
    list.extend(set(tag['src'] for tag in tags))
    
    
    return list
    

    grabimagetags()

    i would only make this change so that you can pass the list of img tags

    0 讨论(0)
提交回复
热议问题