I have a script, VBS or Ruby, that saves a Word document as \'Filtered HTML\', but the encoding parameter is ignored. The HTML file is always encoded in Windows-1252. I\'m u
My solution was to open the HTML file using the same character set, as Word used to save it. I also added a whitelist filter (Sanitize), to clean up the HTML. Further cleaning is done using Nokogiri, which Sanitize also rely on.
require 'sanitize'
# ... add some code converting a Word file to HTML.
# Post export cleanup.
html_file = File.open(html_file_name, "r:windows-1252:utf-8")
html = '' + html_file.read()
html_document = Nokogiri::HTML::Document.parse(html)
Sanitize.new(Sanitize::Config::RESTRICTED).clean_node!(html_document)
html_document.css('html').first['lang'] = 'en-US'
html_document.css('meta[name="Generator"]').first.remove()
# ... add more cleaning up of Words HTML noise.
sanitized_html = html_document.to_html({:encoding => 'utf-8', :indent => 0})
# writing output to (new) file
sanitized_html_file_name = word_file_name.sub(/(.*)\..*$/, '\1.html')
File.open(sanitized_html_file_name, 'w:UTF-8') do |f|
f.write sanitized_html
end
HTML Sanitizer: https://github.com/rgrove/sanitize/
HTML parser and modifier: http://nokogiri.org/
In Word 2010 there is a new method, SaveAs2: http://msdn.microsoft.com/en-us/library/ff836084(v=office.14).aspx
I haven't tested SaveAs2, since I don't have Word 2010.