Repairing invalid HTML with Nokogiri (removing invalid tags)

半腔热情 提交于 2019-12-24 06:44:52

问题


I'm trying to tidy some retrieved HTML using the tidy-ext gem. However, it fails when the HTML is quite broken, so I'm trying to repair the HTML using Nokogiri first:

repaired_html = Nokogiri::HTML.parse(a.raw_html).to_html

It seems to do a nice job but lately I encountered a sample where people inserted FBML markup into the HTML document such as <fb:like> which is somehow preserved by Nokogiri although being invalid. Tidy then says Error: <fb:like> is not recognized! which is understandable.

I'm wondering if there are any more options like strict or something which forces Nokogiri only to include valid HTML tags and omit everything else?


回答1:


You can parse HTML using Nokogiri's XML parser, which is strict by default but that only helps a little, because it will still do fixups so the HTML/XML is marginally correct. By adjusting the flags you can pass to the parser you can make Nokogiri even more rigid so it will refuse to return an invalid document. Nokogiri is not a sanitizer or a white-list for desired tags. Check out Loofah and Sanitize for that functionality.

If your HTML content is in a variable called html, and you do:

doc = Nokogiri::XML.parse(html)

then check doc.errors afterwards to see if you had errors. Nokogiri will attempt to fix them, but anything that generated an error will be flagged there.

For instance:

Nokogiri::XML('<fb:like></fb:like>').errors
=> [#<Nokogiri::XML::SyntaxError: Namespace prefix fb on like is not defined>]

Nokogiri will attempt to fix up the HTML:

Nokogiri::XML('<fb:like></fb:like>').to_xml
=> "<?xml version=\"1.0\"?>\n<like/>\n"

but it only corrects it to the point of removing the unknown namespace on the tag.

If you want to strip those nodes:

doc = Nokogiri::XML('<fb:like></fb:like>')
doc.search('like').each{ |n| n.remove }
doc.to_xml => "<?xml version=\"1.0\"?>\n"


来源:https://stackoverflow.com/questions/11557510/repairing-invalid-html-with-nokogiri-removing-invalid-tags

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!