Problems extracting the XML from a Word document in French with Python: illegal characters generated

后端 未结 1 1666
日久生厌
日久生厌 2021-01-15 20:55

Over the past few days I have been attempting to create a script which would 1) extract the XML from a Word document, 2) modify that XML, and 3) use the new XML to create an

相关标签:
1条回答
  • 2021-01-15 21:23

    The problem is that you are accidentally changing the encoding on word/document.xml in template2.docx. word/document.xml (from template.docx) is initially encoded as UTF-8 (as is the default encoding for XML documents).

    xmlString = zip.read("word/document.xml").decode("utf-8")
    

    However, when you copy it for template2.docx you are changing the encoding to CP-1252. According to the documentation for open(file, "w"),

    In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

    You indicated that calling locale.getpreferredencoding(False) gives you cp1252 which is the encoding word/document.xml is being written.

    Since you did not explicitly add <?xml version="1.0" encoding="cp1252"?> to the beginning of word/document.xml, Word (or any other XML reader) will read it as UTF-8 instead of CP-1252 which is what gives you the illegal XML character error.

    So you want to specify the encoding as UTF-8 when writing by using the encoding argument to open():

    with open(os.path.join(tmpDir, "word/document.xml"), "w", encoding="UTF-8") as f:
        f.write(xmlString)
    
    0 讨论(0)
提交回复
热议问题