I am using beautifulsoup to parse all img tags which is present in \'www.youtube.com\'
The code is
import urllib2
from BeautifulSoup import Beautiful
Explicitly using soup.findAll(name='img')
worked for me, and I don't appear to be missing anything from the page.
Seems to work when I try it here
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.youtube.com/')
soup = BeautifulSoup(page)
tags=soup.findAll('img')
print "\n".join(set(tag['src'] for tag in tags))
Produces this which looks OK to me
http://i1.ytimg.com/vi/D9Zg67r9q9g/market_thumb.jpg?v=723c8e
http://s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif
//s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif
/gen_204?a=fvhr&v=mha7pAOfqt4&nocache=1337083207.97
http://i3.ytimg.com/vi/fNs8mf2OdkU/market_thumb.jpg?v=4f85544b
http://i4.ytimg.com/vi/CkQFjyZCq4M/market_thumb.jpg?v=4f95762c
http://i3.ytimg.com/vi/fzD5gAecqdM/market_thumb.jpg?v=b0cabf
http://i3.ytimg.com/vi/2M3pb2_R2Ng/market_thumb.jpg?v=4f0d95fa
//i2.ytimg.com/vi/mha7pAOfqt4/hqdefault.jpg
Try this.
from simplified_scrapy import SimplifiedDoc, req
url = 'https://www.youtube.com'
html = req.get(url)
doc = SimplifiedDoc(html)
imgs = doc.listImg(url = url)
print([img.url for img in imgs])
imgs = doc.selects('img')
for img in imgs:
print (img)
print (doc.absoluteUrl(url,img.src))
in my case some images didn't contain src
.
so i did this to avoid keyError
exception:
art_imgs = set(img['src'] for img in article.find_all('img') if img.has_attr('src'))
I had the similar problem. I couldn't find all images. So here is the piece of code that will give you any attribute value of an image tag.
from BeautifulSoup import BeautifulSoup as BSHTML
import urllib2
page = urllib2.urlopen('http://www.youtube.com/')
soup = BSHTML(page)
images = soup.findAll('img')
for image in images:
#print image source
print image['src']
#print alternate text
print image['alt']
def grabimagetags():
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen('http://www.youtube.com/')
soup = BeautifulSoup(page)
tags = soup.findAll('img')
list.extend(set(tag['src'] for tag in tags))
return list
grabimagetags()
i would only make this change so that you can pass the list of img tags