问题
I need some feedback on how to count HTML images with Python 3.01 after extracting them, maybe my regular expression are not used properly.
Here is my code:
import re, os
import urllib.request
def get_image(url):
url = 'http://www.google.com'
total = 0
try:
f = urllib.request.urlopen(url)
for line in f.readline():
line = re.compile('<img.*?src="(.*?)">')
if total > 0:
x = line.count(total)
total += x
print('Images total:', total)
except:
pass
回答1:
A couple of points about your code:
- It's much easiser to use a dedicated HTML parsing library to parse your pages (that's the python way).. I personally prefer Beautiful Soup
- You're over-writing your
line
variable in the loop total
will always be 0 with your current logic- no need to compile your RE, as it will be cached by the interpreter
- you're discarding your exception, so no clues about what's going on in the code!
- there could be other attributes to the
<img>
tags.. so your Regex is a little basic, also, use there.findall()
method to catch multiple instances on the same line...
changing your code around a little, I get:
import re
from urllib.request import urlopen
def get_image(url):
total = 0
page = urlopen(url).readlines()
for line in page:
hit = re.findall('<img.*?>', str(line))
total += len(hit)
print('{0} Images total: {1}'.format(url, total))
get_image("http://google.com")
get_image("http://flickr.com")
回答2:
using beautifulsoup4 (an html parser) rather than a regex:
import urllib.request
import bs4 # beautifulsoup4
html = urllib.request.urlopen('http://www.imgur.com/').read()
soup = bs4.BeautifulSoup(html)
images = soup.findAll('img')
print(len(images))
来源:https://stackoverflow.com/questions/17395359/counting-html-images-with-python