Word Document.SaveAs ignores encoding, when calling through OLE, from Ruby or VBS

后端 未结 3 1445
挽巷
挽巷 2021-01-14 14:25

I have a script, VBS or Ruby, that saves a Word document as \'Filtered HTML\', but the encoding parameter is ignored. The HTML file is always encoded in Windows-1252. I\'m u

相关标签:
3条回答
  • 2021-01-14 14:48

    My solution was to open the HTML file using the same character set, as Word used to save it. I also added a whitelist filter (Sanitize), to clean up the HTML. Further cleaning is done using Nokogiri, which Sanitize also rely on.

    require 'sanitize'
    
    # ... add some code converting a Word file to HTML.
    
    # Post export cleanup.
    html_file = File.open(html_file_name, "r:windows-1252:utf-8")
    html = '<!DOCTYPE html>' + html_file.read()
    html_document = Nokogiri::HTML::Document.parse(html)
    Sanitize.new(Sanitize::Config::RESTRICTED).clean_node!(html_document)
    html_document.css('html').first['lang'] = 'en-US'
    html_document.css('meta[name="Generator"]').first.remove()
    
    # ... add more cleaning up of Words HTML noise.
    
    sanitized_html = html_document.to_html({:encoding => 'utf-8', :indent => 0})
    # writing output to (new) file
    sanitized_html_file_name = word_file_name.sub(/(.*)\..*$/, '\1.html')
    File.open(sanitized_html_file_name, 'w:UTF-8') do |f|
        f.write sanitized_html
    end
    

    HTML Sanitizer: https://github.com/rgrove/sanitize/

    HTML parser and modifier: http://nokogiri.org/

    In Word 2010 there is a new method, SaveAs2: http://msdn.microsoft.com/en-us/library/ff836084(v=office.14).aspx

    I haven't tested SaveAs2, since I don't have Word 2010.

    0 讨论(0)
  • 2021-01-14 14:48

    Hi Bo Frederiksen and kardeiz,

    I also encountered the problem of "Word Document.SaveAs ignores encoding" today in my "Word 2003 (11.8411.8202) SP3" version.

    Luckily I managed to make msoEncodingUTF8(namely, 65001) work in VBA code. However, I have to change the Word document's settings first. Steps are:

    1) From Word's 'Tools' menu, choose 'Options'.

    2) Then click 'General'.

    3) Press the 'Web Options' button.

    4) In the popping-up 'Web Options' dialogue, click 'Encoding'.

    5) You can find a combobox, now you can change the encoding, for example, from 'GB2312' to 'Unicode (UTF-8)'.

    6) Save the changes and try to rerun the VBA code.

    I hope my answer can help you. Below is my code.

    Public Sub convert2html()
        With ActiveDocument.WebOptions
            .Encoding = msoEncodingUTF8
        End With
    
        ActiveDocument.SaveAs FileName:=ActiveDocument.Path & "\" & "file_name.html", FileFormat:=wdFormatFilteredHTML, Encoding:=msoEncodingUTF8
    
    End Sub
    
    0 讨论(0)
  • 2021-01-14 14:58

    Word can't do this as far as I know.

    However, you could add the following lines to the end of your Ruby script

    text_as_utf8 = File.read('C:\whatever.html').encode('UTF-8')
    File.open('C:\whatever.html','wb') {|f| f.print text_as_utf8}
    

    If you have an older version of Ruby, you may need to use Iconv. If you have special characters in 'C:\whatever.html', you'll want to look into your invalid/undefined replacement options.

    You'll also probably want to update the charset in the HTML meta tag:

    text_as_utf8.gsub!('charset=windows-1252', 'charset=UTF-8')
    

    before you write to the file.

    0 讨论(0)
提交回复
热议问题