I'm currently using the RubyTidy Ruby bindings for HTML tidy to make sure HTML I receive is well-formed. Currently this library is the only thing holding me back from getting a Rails application on Ruby 1.9. Are there any alternative libraries out there that will tidy up chunks of HTML on Ruby 1.9?
http://github.com/libc/tidy_ffi/blob/master/README.rdoc works with ruby 1.9 (latest version)
If you are working on windows, you need to set the library_path eg
require 'tidy_ffi'
TidyFFI.library_path = 'lib\\tidy\\bin\\tidy.dll'
tidy = TidyFFI::Tidy.new('test')
puts tidy.clean
(It uses the same dll as tidy) The above links gives you more example of the usage.
I am using Nokogiri to fix invalid html:
Nokogiri::HTML::DocumentFragment.parse(html).to_html
Here is a nice example of how to make your html look better using tidy:
require 'tidy'
Tidy.path = '/opt/local/lib/libtidy.dylib' # or where ever your tidylib resides
nice_html = ""
Tidy.open(:show_warnings=>true) do |tidy|
tidy.options.output_xhtml = true
tidy.options.wrap = 0
tidy.options.indent = 'auto'
tidy.options.indent_attributes = false
tidy.options.indent_spaces = 4
tidy.options.vertical_space = false
tidy.options.char_encoding = 'utf8'
nice_html = tidy.clean(my_nasty_html_string)
end
# remove excess newlines
nice_html = nice_html.strip.gsub(/\n+/, "\n")
puts nice_html
For more tidy options, check out the man page.
Currently this library is the only thing holding me back from getting a Rails application on Ruby 1.9.
Watch out, the Ruby Tidy bindings have some nasty memory leaks. It's currently unusable in long running processes. (for the record, I'm using http://github.com/ak47/tidy)
I just had to remove it from a production Rails 2.3 application because it was leaking about 1MB/min.
来源:https://stackoverflow.com/questions/1308713/html-tidy-cleaning-in-ruby-1-9