问题
I'm trying to tidy some retrieved HTML using the tidy-ext gem. However, it fails when the HTML is quite broken, so I'm trying to repair the HTML using Nokogiri first:
repaired_html = Nokogiri::HTML.parse(a.raw_html).to_html
It seems to do a nice job but lately I encountered a sample where people inserted FBML markup into the HTML document such as <fb:like>
which is somehow preserved by Nokogiri although being invalid. Tidy then says Error: <fb:like> is not recognized!
which is understandable.
I'm wondering if there are any more options like strict or something which forces Nokogiri only to include valid HTML tags and omit everything else?
回答1:
You can parse HTML using Nokogiri's XML parser, which is strict by default but that only helps a little, because it will still do fixups so the HTML/XML is marginally correct. By adjusting the flags you can pass to the parser you can make Nokogiri even more rigid so it will refuse to return an invalid document. Nokogiri is not a sanitizer or a white-list for desired tags. Check out Loofah and Sanitize for that functionality.
If your HTML content is in a variable called html
, and you do:
doc = Nokogiri::XML.parse(html)
then check doc.errors
afterwards to see if you had errors. Nokogiri will attempt to fix them, but anything that generated an error will be flagged there.
For instance:
Nokogiri::XML('<fb:like></fb:like>').errors
=> [#<Nokogiri::XML::SyntaxError: Namespace prefix fb on like is not defined>]
Nokogiri will attempt to fix up the HTML:
Nokogiri::XML('<fb:like></fb:like>').to_xml
=> "<?xml version=\"1.0\"?>\n<like/>\n"
but it only corrects it to the point of removing the unknown namespace on the tag.
If you want to strip those nodes:
doc = Nokogiri::XML('<fb:like></fb:like>')
doc.search('like').each{ |n| n.remove }
doc.to_xml => "<?xml version=\"1.0\"?>\n"
来源:https://stackoverflow.com/questions/11557510/repairing-invalid-html-with-nokogiri-removing-invalid-tags