Word Document.SaveAs ignores encoding, when calling through OLE, from Ruby or VBS

后端 未结 3 1447
挽巷
挽巷 2021-01-14 14:25

I have a script, VBS or Ruby, that saves a Word document as \'Filtered HTML\', but the encoding parameter is ignored. The HTML file is always encoded in Windows-1252. I\'m u

3条回答
  •  终归单人心
    2021-01-14 14:48

    My solution was to open the HTML file using the same character set, as Word used to save it. I also added a whitelist filter (Sanitize), to clean up the HTML. Further cleaning is done using Nokogiri, which Sanitize also rely on.

    require 'sanitize'
    
    # ... add some code converting a Word file to HTML.
    
    # Post export cleanup.
    html_file = File.open(html_file_name, "r:windows-1252:utf-8")
    html = '' + html_file.read()
    html_document = Nokogiri::HTML::Document.parse(html)
    Sanitize.new(Sanitize::Config::RESTRICTED).clean_node!(html_document)
    html_document.css('html').first['lang'] = 'en-US'
    html_document.css('meta[name="Generator"]').first.remove()
    
    # ... add more cleaning up of Words HTML noise.
    
    sanitized_html = html_document.to_html({:encoding => 'utf-8', :indent => 0})
    # writing output to (new) file
    sanitized_html_file_name = word_file_name.sub(/(.*)\..*$/, '\1.html')
    File.open(sanitized_html_file_name, 'w:UTF-8') do |f|
        f.write sanitized_html
    end
    

    HTML Sanitizer: https://github.com/rgrove/sanitize/

    HTML parser and modifier: http://nokogiri.org/

    In Word 2010 there is a new method, SaveAs2: http://msdn.microsoft.com/en-us/library/ff836084(v=office.14).aspx

    I haven't tested SaveAs2, since I don't have Word 2010.

提交回复
热议问题