Counting HTML images with Python

戏子无情 提交于 2020-01-16 02:54:07

问题


I need some feedback on how to count HTML images with Python 3.01 after extracting them, maybe my regular expression are not used properly.

Here is my code:

import re, os
import urllib.request
def get_image(url):
  url = 'http://www.google.com'
  total = 0
  try:
    f = urllib.request.urlopen(url)
    for line in f.readline():
      line = re.compile('<img.*?src="(.*?)">')
      if total > 0:
        x = line.count(total)
        total += x
        print('Images total:', total)

  except:
    pass

回答1:


A couple of points about your code:

  1. It's much easiser to use a dedicated HTML parsing library to parse your pages (that's the python way).. I personally prefer Beautiful Soup
  2. You're over-writing your line variable in the loop
  3. total will always be 0 with your current logic
  4. no need to compile your RE, as it will be cached by the interpreter
  5. you're discarding your exception, so no clues about what's going on in the code!
  6. there could be other attributes to the <img> tags.. so your Regex is a little basic, also, use the re.findall() method to catch multiple instances on the same line...

changing your code around a little, I get:

import re
from urllib.request import urlopen

def get_image(url):

    total  = 0
    page   = urlopen(url).readlines()

    for line in page:

        hit   = re.findall('<img.*?>', str(line))
        total += len(hit)

    print('{0} Images total: {1}'.format(url, total))

get_image("http://google.com")
get_image("http://flickr.com")



回答2:


using beautifulsoup4 (an html parser) rather than a regex:

import urllib.request

import bs4  # beautifulsoup4

html = urllib.request.urlopen('http://www.imgur.com/').read()
soup = bs4.BeautifulSoup(html)
images = soup.findAll('img')
print(len(images))


来源:https://stackoverflow.com/questions/17395359/counting-html-images-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!