python: find html tags and replace their attributes [duplicate]

夙愿已清 提交于 2021-01-27 07:04:41

问题


I need to do the following:

  1. take html document
  2. find every occurrence of 'img' tag
  3. take their 'src' attribute
  4. pass founded url to processing
  5. change the 'src' attribute to the new one
  6. do all this stuff with Python 2.7

P.S. I,ve heard about lmxl and BeautifulSoup. How do you recommend to solve this problem with? Maybe it would be better to use regexes then? or another something else?


回答1:


from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
for link in soup.findAll('a')
    link['src'] = 'New src'
html_string = str(soup)

I don't particularly like BeautifulSoup but it does the job for you. Try to not over-do your solution if you don't have to, this being one of the simpler things you can do to solve a general issue.

That said, building for the future is equally important but all your 6 requirements can be put down into one, "I want to change 'src' or all links to X"




回答2:


Using lxml,

import lxml.html as LH
root = LH.fromstring(html_string)
for el in root.iter('img'):
    el.attrib['src'] = 'new src'
print(LH.tostring(root, pretty_print=True))

Parsing HTML with regex is generally a bad idea. Using an HTML parser like BeautifulSoup or lxml.html is a better idea.

One attraction of using BeautifulSoup is that it has a familiar Python interface. There are lots of functions for navigation: find_all, find_next, find_previous, find_parent, find_next_siblings, etc.

Another point in favor of BeautifulSoup is that sometimes BeautifulSoup can parse broken HTML (by inserting missing closing tags, for example) when lxml can not. lxml is more strict and simply raises an exception if the HTML is malformed.

In contrast to the multitude of functions provided by the BeautifulSoup API, lxml mainly uses the XPath mini-language for navigation. Navigation using XPath tends to be more succinct than the equivalent using BeautifulSoup. The problem is you have to learn XPath. lxml is also much much faster than BeautifulSoup.

So if you are just starting out, BeautifulSoup may be easier to use instantly, but in the end I believe lxml is more pleasant to work with.




回答3:


Just throwing this answer in there, if you wanted to use regexes:

html = """
<!doctype html>
<html lang="en-US">
<head>
    <meta charset="UTF-8">
    <title></title>
</head>
<body>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/1.10.1/jquery.min.js"></script>
<script src="http://ajax.googleapis.com/ajax/libs/jquery/2.0.1/jquery.min.js"></script>
</body>
</html>
"""

import re

find = re.compile(r'src="[^"]*"')

print find.sub('src="changed"', html)



回答4:


Here is an lxml approach:

import lxml.html

filename = 'your_html_filename.html'
document = lxml.html.parse(filename)
tag = 'your_tag_name'
elements = document.xpath('//{}'.format(tag))

for e in elements:
    e.attrib['src'] = 'new value'

result = str(document)

For your particular problem there is no precise advantage in using BS or lxml. This will matter only in the context of your problem.



来源:https://stackoverflow.com/questions/19357506/python-find-html-tags-and-replace-their-attributes

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!