问题
I am attempting to create a script which can extract the XML from a Word document, modify it, and finally save the new Word document, all using Python. Here's the code I used, which was effectively stolen from here:
import zipfile
import os
import tempfile
import shutil
def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString = str(zip.read("word/document.xml"))
return xmlString
def createNewDocx(originalDocx,xmlContent,newFilename):
tmpDir = tempfile.mkdtemp()
zip = zipfile.ZipFile(open(originalDocx,"rb"))
zip.extractall(tmpDir)
with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
f.write(xmlContent)
filenames = zip.namelist()
zipCopyFilename = newFilename
with zipfile.ZipFile(zipCopyFilename,"w") as docx:
for filename in filenames:
docx.write(os.path.join(tmpDir,filename),filename)
shutil.rmtree(tmpDir)
One important difference between my code and Virantha's is that he expressed createNewDocx as a class. Unfortunately I don't know what classes are or how they work, so I figured it would be easier to write a function instead.
getXML
extracts the XML from a Word document. I tried it out on a test document (named test.docx
) and it worked well. In theory, createNewDocx
is supposed to take the original docx file (in this case, test.docs
) and the modified XML as a string to create a new Word document, entitled newFileName.
As a test, I ran createNewDocx
with the original XML to see if I would get a copied version of text.docx
. That is, I ran
originalXml = getXml("test.docx")
createNewDocx("test.docx",originalXml,"test2.docx")
This did indeed create a Word document entitled "test2.docx", but when I tried to open the file it just wouldn't open; Word would just crash.
Does anyone know how I can modify my code to make it work?
EDIT: I decided to include originalXml
in case there's some problem with how it's formatted.
b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="00000000" w:rsidRDefault="00971B91"><w:r><w:t>You owe me ${debt}. Pay back soon.</w:t></w:r></w:p><w:p w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidRDefault="00971B91"><w:pPr><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr></w:pPr><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t xml:space="preserve">You owe me </w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:b/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>${debt}</w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t xml:space="preserve">. Pay back </w:t></w:r><w:r><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:i/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr><w:t>soon.</w:t></w:r></w:p><w:sectPr w:rsidR="00971B91" w:rsidRPr="00971B91"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'
EDIT2: I looked more closely at the XML code above and realized that there was an unusual "b'" at the beginning and a close parentheses at the end. I removed these anomalies and ran the code again. Now Word is giving me a more sensible error, namely that there's a problem with "line 1, column 56." That corresponds to the "\r\" in the XML code above.
So obviously my code isn't extracting the XML properly. Anyone know how to fix this?
回答1:
By casting "zip.read("word/document.xml")", you cast a byte to string so you keep the 'b' as a char.
def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString = str(zip.read("word/document.xml"))
return xmlString
So that's why the "xmlString" has no attribute because it's a string. You have to remove you cast an decode before return:
def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString = zip.read("word/document.xml")
return xmlString.decode('utf-8')
Hope it will be helpful for others !
来源:https://stackoverflow.com/questions/27492790/how-can-i-save-an-edited-word-document-with-python