Remove contents of <style>…</style> tags using html5lib or bleach

天涯浪子 提交于 2019-12-10 10:37:15

问题


I've been using the excellent bleach library for removing bad HTML.

I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like:

<STYLE> st1:*{behavior:url(#ieooui) } </STYLE>

Using bleach (with the style tag implicitly disallowed), leaves me with:

st1:*{behavior:url(#ieooui) }

Which isn't helpful. Bleach seems only to have options to:

  • Escape tags;
  • Remove the tags (but not their contents).

I'm looking for a third option - remove the tags and their contents.

Is there any way to use bleach or html5lib to completely remove the style tag and its contents? The documentation for html5lib isn't really a great deal of help.


回答1:


It turned out lxml was a better tool for this task:

from lxml.html.clean import Cleaner

def clean_word_text(text):
    # The only thing I need Cleaner for is to clear out the contents of
    # <style>...</style> tags
    cleaner = Cleaner(style=True)
    return cleaner.clean_html(text)


来源:https://stackoverflow.com/questions/7538600/remove-contents-of-style-style-tags-using-html5lib-or-bleach

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!