Word Document.SaveAs ignores encoding, when calling through OLE, from Ruby or VBS

六眼飞鱼酱① 提交于 2019-12-01 08:46:23

Word can't do this as far as I know.

However, you could add the following lines to the end of your Ruby script

text_as_utf8 = File.read('C:\whatever.html').encode('UTF-8')
File.open('C:\whatever.html','wb') {|f| f.print text_as_utf8}

If you have an older version of Ruby, you may need to use Iconv. If you have special characters in 'C:\whatever.html', you'll want to look into your invalid/undefined replacement options.

You'll also probably want to update the charset in the HTML meta tag:

text_as_utf8.gsub!('charset=windows-1252', 'charset=UTF-8')

before you write to the file.

My solution was to open the HTML file using the same character set, as Word used to save it. I also added a whitelist filter (Sanitize), to clean up the HTML. Further cleaning is done using Nokogiri, which Sanitize also rely on.

require 'sanitize'

# ... add some code converting a Word file to HTML.

# Post export cleanup.
html_file = File.open(html_file_name, "r:windows-1252:utf-8")
html = '<!DOCTYPE html>' + html_file.read()
html_document = Nokogiri::HTML::Document.parse(html)
Sanitize.new(Sanitize::Config::RESTRICTED).clean_node!(html_document)
html_document.css('html').first['lang'] = 'en-US'
html_document.css('meta[name="Generator"]').first.remove()

# ... add more cleaning up of Words HTML noise.

sanitized_html = html_document.to_html({:encoding => 'utf-8', :indent => 0})
# writing output to (new) file
sanitized_html_file_name = word_file_name.sub(/(.*)\..*$/, '\1.html')
File.open(sanitized_html_file_name, 'w:UTF-8') do |f|
    f.write sanitized_html
end

HTML Sanitizer: https://github.com/rgrove/sanitize/

HTML parser and modifier: http://nokogiri.org/

In Word 2010 there is a new method, SaveAs2: http://msdn.microsoft.com/en-us/library/ff836084(v=office.14).aspx

I haven't tested SaveAs2, since I don't have Word 2010.

Hi Bo Frederiksen and kardeiz,

I also encountered the problem of "Word Document.SaveAs ignores encoding" today in my "Word 2003 (11.8411.8202) SP3" version.

Luckily I managed to make msoEncodingUTF8(namely, 65001) work in VBA code. However, I have to change the Word document's settings first. Steps are:

1) From Word's 'Tools' menu, choose 'Options'.

2) Then click 'General'.

3) Press the 'Web Options' button.

4) In the popping-up 'Web Options' dialogue, click 'Encoding'.

5) You can find a combobox, now you can change the encoding, for example, from 'GB2312' to 'Unicode (UTF-8)'.

6) Save the changes and try to rerun the VBA code.

I hope my answer can help you. Below is my code.

Public Sub convert2html()
    With ActiveDocument.WebOptions
        .Encoding = msoEncodingUTF8
    End With

    ActiveDocument.SaveAs FileName:=ActiveDocument.Path & "\" & "file_name.html", FileFormat:=wdFormatFilteredHTML, Encoding:=msoEncodingUTF8

End Sub
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!