问题
The tidy
gem is no longer maintained and has multiple memory leak issues.
Some people suggested using Nokogiri.
I'm currently cleaning the HTML using:
Nokogiri::HTML::DocumentFragment.parse(html).to_html
I've got two issues though:
Nokogiri removes the
DOCTYPE
Is there an easy way to force the cleaned HTML to have a
html
andbody
tag?
回答1:
If you are processing a full document, you want:
Nokogiri::HTML(html).to_html
That will force html
and body
tags, and introduce or preserve the DOCTYPE
:
puts Nokogiri::HTML('<p>Hi!</p>').to_html
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
#=> "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body><p>Hi!</p></body></html>
puts Nokogiri::HTML('<!DOCTYPE html><p>Hi!</p>').to_html
#=> <!DOCTYPE html>
#=> <html><body><p>Hi!</p></body></html>
Note that the output is not guaranteed to be syntactically valid. For example, if I provide a broken document that lies and claims that it is HTML4.01 strict, Nokogiri will output a document with that DOCTYPE but without the required <head><title>...</title></head>
section:
dtd = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">'
puts Nokogiri::HTML("#{dtd}<p>Hi!</p>").to_html
#=> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
#=> "http://www.w3.org/TR/html4/strict.dtd">
#=> <html><body><p>Hi!</p></body></html>
回答2:
The Tidy gem might not be supported, but the underlying tidy app is maintained, and that is what you really need. It's flexible and has quite a list of options.
You can pass HTML to it in many different ways, and define its configuration in a .tidyrc
file or pass them on the command-line. You could use Ruby's %x{}
to pass it a file or use IO.popen
, or IO.pipe
to treat it as a pipe.
来源:https://stackoverflow.com/questions/5584893/cleaning-html-with-nokogiri-instead-of-tidy