问题
I need to do the following:
- take html document
- find every occurrence of 'img' tag
- take their 'src' attribute
- pass founded url to processing
- change the 'src' attribute to the new one
- do all this stuff with Python 2.7
P.S. I,ve heard about lmxl and BeautifulSoup. How do you recommend to solve this problem with? Maybe it would be better to use regexes then? or another something else?
回答1:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
for link in soup.findAll('a')
link['src'] = 'New src'
html_string = str(soup)
I don't particularly like BeautifulSoup but it does the job for you. Try to not over-do your solution if you don't have to, this being one of the simpler things you can do to solve a general issue.
That said, building for the future is equally important but all your 6 requirements can be put down into one, "I want to change 'src' or all links to X"
回答2:
Using lxml,
import lxml.html as LH
root = LH.fromstring(html_string)
for el in root.iter('img'):
el.attrib['src'] = 'new src'
print(LH.tostring(root, pretty_print=True))
Parsing HTML with regex is generally a bad idea. Using an HTML parser like BeautifulSoup or lxml.html is a better idea.
One attraction of using BeautifulSoup is that it has a familiar Python interface. There are lots of functions for navigation: find_all
, find_next
, find_previous
, find_parent
, find_next_siblings
, etc.
Another point in favor of BeautifulSoup is that sometimes BeautifulSoup can parse broken HTML (by inserting missing closing tags, for example) when lxml
can not. lxml
is more strict and simply raises an exception if the HTML is malformed.
In contrast to the multitude of functions provided by the BeautifulSoup API, lxml
mainly uses the XPath mini-language for navigation. Navigation using XPath tends to be more succinct than the equivalent using BeautifulSoup. The problem is you have to learn XPath. lxml is also much much faster than BeautifulSoup.
So if you are just starting out, BeautifulSoup may be easier to use instantly, but in the end I believe lxml is more pleasant to work with.
回答3:
Just throwing this answer in there, if you wanted to use regexes:
html = """
<!doctype html>
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.1/jquery.min.js"></script>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/2.0.1/jquery.min.js"></script>
</body>
</html>
"""
import re
find = re.compile(r'src="[^"]*"')
print find.sub('src="changed"', html)
回答4:
Here is an lxml
approach:
import lxml.html
filename = 'your_html_filename.html'
document = lxml.html.parse(filename)
tag = 'your_tag_name'
elements = document.xpath('//{}'.format(tag))
for e in elements:
e.attrib['src'] = 'new value'
result = str(document)
For your particular problem there is no precise advantage in using BS
or lxml
. This will matter only in the context of your problem.
来源:https://stackoverflow.com/questions/19357506/python-find-html-tags-and-replace-their-attributes