How can I save an edited Word document with Python?

问题

I am attempting to create a script which can extract the XML from a Word document, modify it, and finally save the new Word document, all using Python. Here's the code I used, which was effectively stolen from here:

import zipfile
import os
import tempfile
import shutil


def getXml(docxFilename):
    zip = zipfile.ZipFile(open(docxFilename,"rb"))
    xmlString = str(zip.read("word/document.xml"))
    return xmlString

def createNewDocx(originalDocx,xmlContent,newFilename):
    tmpDir = tempfile.mkdtemp()
    zip = zipfile.ZipFile(open(originalDocx,"rb"))
    zip.extractall(tmpDir)
    with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
        f.write(xmlContent)
    filenames = zip.namelist()
    zipCopyFilename = newFilename
    with zipfile.ZipFile(zipCopyFilename,"w") as docx:
        for filename in filenames:
            docx.write(os.path.join(tmpDir,filename),filename)
    shutil.rmtree(tmpDir)

One important difference between my code and Virantha's is that he expressed createNewDocx as a class. Unfortunately I don't know what classes are or how they work, so I figured it would be easier to write a function instead.

getXML extracts the XML from a Word document. I tried it out on a test document (named test.docx) and it worked well. In theory, createNewDocx is supposed to take the original docx file (in this case, test.docs) and the modified XML as a string to create a new Word document, entitled newFileName.

As a test, I ran createNewDocx with the original XML to see if I would get a copied version of text.docx. That is, I ran

originalXml = getXml("test.docx")
createNewDocx("test.docx",originalXml,"test2.docx")

This did indeed create a Word document entitled "test2.docx", but when I tried to open the file it just wouldn't open; Word would just crash.

Does anyone know how I can modify my code to make it work?

EDIT: I decided to include originalXml in case there's some problem with how it's formatted.

b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="00000000" w:rsidRDefault="00971B91"><w:r><w:t>You owe me ${debt}. Pay back soon.</w:t></w:r></w:p><w:p w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidRDefault="00971B91"><w:pPr><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr></w:pPr><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t xml:space="preserve">You owe me </w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:b/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>${debt}</w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t xml:space="preserve">. Pay back </w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:i/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>soon.</w:t></w:r></w:p><w:sectPr w:rsidR="00971B91" w:rsidRPr="00971B91"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'

EDIT2: I looked more closely at the XML code above and realized that there was an unusual "b'" at the beginning and a close parentheses at the end. I removed these anomalies and ran the code again. Now Word is giving me a more sensible error, namely that there's a problem with "line 1, column 56." That corresponds to the "\r\" in the XML code above.

So obviously my code isn't extracting the XML properly. Anyone know how to fix this?

回答1:

By casting "zip.read("word/document.xml")", you cast a byte to string so you keep the 'b' as a char.

def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString = str(zip.read("word/document.xml"))
return xmlString

So that's why the "xmlString" has no attribute because it's a string. You have to remove you cast an decode before return:

def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString = zip.read("word/document.xml")
return xmlString.decode('utf-8')

Hope it will be helpful for others !

来源：https://stackoverflow.com/questions/27492790/how-can-i-save-an-edited-word-document-with-python

标签

python

xml

docx

zipfile