Problems extracting the XML from a Word document in French with Python: illegal characters generated

后端未结

关注

 1  1684

Over the past few days I have been attempting to create a script which would 1) extract the XML from a Word document, 2) modify that XML, and 3) use the new XML to create an

相关标签:

1条回答

青春惊慌失措

2021-01-15 21:23
The problem is that you are accidentally changing the encoding on word/document.xml in template2.docx. word/document.xml (from template.docx) is initially encoded as UTF-8 (as is the default encoding for XML documents).
```
xmlString = zip.read("word/document.xml").decode("utf-8")
```
However, when you copy it for template2.docx you are changing the encoding to CP-1252. According to the documentation for open(file, "w"),

In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

You indicated that calling locale.getpreferredencoding(False) gives you cp1252 which is the encoding word/document.xml is being written.

Since you did not explicitly add <?xml version="1.0" encoding="cp1252"?> to the beginning of word/document.xml, Word (or any other XML reader) will read it as UTF-8 instead of CP-1252 which is what gives you the illegal XML character error.

So you want to specify the encoding as UTF-8 when writing by using the encoding argument to open():
```
with open(os.path.join(tmpDir, "word/document.xml"), "w", encoding="UTF-8") as f:
    f.write(xmlString)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...