How can I insert a checkbox form into a .docx file using python-docx?

I've been using python to implement a custom parser and use that parsed data to format a word document to be distributed internally. All of the formatting has been straightforward and easy so far but I'm completely stumped on how to insert a checkbox into individual table cells.

I've tried using the python object functions within python-docx (using get_or_add_tcPr(), etc.) which causes MS Word to throw the following error when I try to open the file, "The file xxxx cannot be opened because there are problems with the contents Details: The file is corrupt and cannot be opened".

After struggling with this for a while I moved to a second approach involving manipulating the word/document.xml file for the output doc. I've retrieved what I believe to be the correct xml for a checkbox saved as replacementXML and have inserted filler text into the cells to act as a tag that can be searched and replaced, searchXML. The following seems to run using python in a linux (Fedora 25) environment but the word document displays the same errors when I try to open the document, however this time the document is recoverable and reverts back to the filler text. I've been able to get this to work with a manually made document and using an empty table cell, so I believe that this should be possible. NOTE: I've included the whole xml element for the table cell in the searchXML variable, but I've tried using regular expressions and shortening the string. Not just using an exact match as I know this could differ cell to cell.

searchXML = r'<w:tc><w:tcPr><w:tcW w:type="dxa" w:w="4320"/><w:gridSpan w:val="2"/></w:tcPr><w:p><w:pPr><w:jc w:val="right"/></w:pPr><w:r><w:rPr><w:sz w:val="16"/></w:rPr><w:t>IN_CHECKB</w:t></w:r></w:p></w:tc>'

def addCheckboxes(): 
    os.system("mkdir unzipped")
    os.system("unzip tempdoc.docx -d unzipped/")

    with open('unzipped/word/document.xml', encoding="ISO-8859-1") as file:
        filedata = file.read()

    rep_count = 0
    while re.search(searchXML, filedata):
        filedata = replaceXML(filedata, rep_count)
        rep_count += 1

    with open('unzipped/word/document.xml', 'w') as file:
        file.write(filedata)

    os.system("zip -r ../buildcfg/tempdoc.docx unzipped/*")
    os.system("rm -rf unzipped")

def replaceXML(filedata, rep_count):
    replacementXML = r'<w:tc><w:tcPr><w:tcW w:w="4320" w:type="dxa"/><w:gridSpan w:val="2"/></w:tcPr><w:p w:rsidR="00D2569D" w:rsidRDefault="00FD6FDF"><w:pPr><w:jc w:val="right"/></w:pPr><w:r><w:rPr><w:sz w:val="16"/>
                       </w:rPr><w:fldChar w:fldCharType="begin"><w:ffData><w:name w:val="Check1"/><w:enabled/><w:calcOnExit w:val="0"/><w:checkBox><w:sizeAuto/><w:default w:val="0"/></w:checkBox></w:ffData></w:fldChar>
                       </w:r><w:bookmarkStart w:id="' + rep_count + '" w:name="Check' + rep_count + '"/><w:r><w:rPr><w:sz w:val="16"/></w:rPr><w:instrText xml:space="preserve"> FORMCHECKBOX </w:instrText></w:r><w:r>
                       <w:rPr><w:sz w:val="16"/></w:rPr></w:r><w:r><w:rPr><w:sz w:val="16"/></w:rPr><w:fldChar w:fldCharType="end"/></w:r><w:bookmarkEnd w:id="' + rep_count + '"/></w:p></w:tc>'
    filedata = re.sub(searchXML, replacementXML, filedata, 1)

    rerturn filedata

I have a strong feeling that there is a much simpler (and correct!) way of doing this through the python-docx library but for some reason I can't seem to get it right.

Is there a way to easily insert checkbox fields into a table cell in an MS Word doc? And if yes, how would I do that? If no, is there a better approach than manipulating the .xml file?

UPDATE: I've been able to inject XML into the document succesffuly using python-docx but the checkbox and added XML are not appearing.

I've added the following XML into a table cell:

<w:tc>
  <w:tcPr>
    <w:tcW w:type="dxa" w:w="4320"/>
    <w:gridSpan w:val="2"/>
  </w:tcPr>
  <w:p>
    <w:r>
      <w:bookmarkStart w:id="0" w:name="testName">
        <w:complexType w:name="CT_FFCheckBox">
          <w:sequence>
            <w:choice>
              <w:element w:name="size" w:type="CT_HpsMeasure"/>
              <w:element w:name="sizeAuto" w:type="CT_OnOff"/>
            </w:choice>
            <w:element w:name="default" w:type="CT_OnOff" w:minOccurs="0"/>
            <w:element w:name="checked" w:type="CT_OnOff" w:minOccurs="0"/>
          </w:sequence>
        </w:complexType>
      </w:bookmarkStart>
      <w:bookmarkEnd w:id="0" w:name="testName"/>
    </w:r>
  </w:p>
</w:tc>

by using the following python-docx code:

run = p.add_run()
tag = run._r
start = docx.oxml.shared.OxmlElement('w:bookmarkStart')
start.set(docx.oxml.ns.qn('w:id'), '0')
start.set(docx.oxml.ns.qn('w:name'), n)
tag.append(start)

ctype = docx.oxml.OxmlElement('w:complexType')
ctype.set(docx.oxml.ns.qn('w:name'), 'CT_FFCheckBox')
seq = docx.oxml.OxmlElement('w:sequence')
choice = docx.oxml.OxmlElement('w:choice')
el = docx.oxml.OxmlElement('w:element')
el.set(docx.oxml.ns.qn('w:name'), 'size')
el.set(docx.oxml.ns.qn('w:type'), 'CT_HpsMeasure')
el2 = docx.oxml.OxmlElement('w:element')
el2.set(docx.oxml.ns.qn('w:name'), 'sizeAuto')
el2.set(docx.oxml.ns.qn('w:type'), 'CT_OnOff')

choice.append(el)
choice.append(el2)

el3 = docx.oxml.OxmlElement('w:element')
el3.set(docx.oxml.ns.qn('w:name'), 'default')
el3.set(docx.oxml.ns.qn('w:type'), 'CT_OnOff')
el3.set(docx.oxml.ns.qn('w:minOccurs'), '0')
el4 = docx.oxml.OxmlElement('w:element')
el4.set(docx.oxml.ns.qn('w:name'), 'checked')
el4.set(docx.oxml.ns.qn('w:type'), 'CT_OnOff')
el4.set(docx.oxml.ns.qn('w:minOccurs'), '0')

seq.append(choice)
seq.append(el3)
seq.append(el4)

ctype.append(seq)
start.append(ctype)

end = docx.oxml.shared.OxmlElement('w:bookmarkEnd')
end.set(docx.oxml.ns.qn('w:id'), '0')
end.set(docx.oxml.ns.qn('w:name'), n)
tag.append(end)

Can't seem to find reasoning for the XML not being reflected in the output document but will update with whatever I find.

Crudough

I've finally been able to accomplish this after lots of digging and help from @scanny.

Checkboxes can be inserted into any paragraph in python-docx using the following function. I am inserting a checkbox into specific cells in a table.

def addCheckbox(para, box_id, name):

run = para.add_run()
tag = run._r
fld = docx.oxml.shared.OxmlElement('w:fldChar')
fld.set(docx.oxml.ns.qn('w:fldCharType'), 'begin')
fldData = docx.oxml.shared.OxmlElement('w:fldData')

fldData.text = '/////2UAAAAUAAYAQwBoAGUAYwBrADEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
fldData.set(docx.oxml.ns.qn('xml:space'), 'preserve')
fld.append(fldData)
tag.append(fld)

run2 = para.add_run()
tag2 = run2._r
start = docx.oxml.shared.OxmlElement('w:bookmarkStart')
start.set(docx.oxml.ns.qn('w:id'), str(box_id))
start.set(docx.oxml.ns.qn('w:name'), name)
tag2.append(start)

run3 = para.add_run()
tag3 = run3._r
instr = docx.oxml.OxmlElement('w:instrText')
instr.text = 'FORMCHECKBOX'
tag3.append(instr)

run4 = para.add_run()
tag4 = run4._r
fld2 = docx.oxml.shared.OxmlElement('w:fldChar')
fld2.set(docx.oxml.ns.qn('w:fldCharType'), 'end')
tag4.append(fld2)

run5 = para.add_run()
tag5 = run5._r
end = docx.oxml.shared.OxmlElement('w:bookmarkEnd')
end.set(docx.oxml.ns.qn('w:id'), str(box_id))
end.set(docx.oxml.ns.qn('w:name'), name)
tag5.append(end)

return

The fldData.text object seems random but was taken from the generated XML form a word document with an existing checkbox. The function fails without setting this text. I have not confirmed but I have heard of one scenario where a developer was arbitrarily changing the string but once saved it would revert back to the original generated value.

The key thing with these workaround functions is to have an example of XML that works, and to be able to compare the XML you generate. If you generate XML that matches the working example, it will work every time. opc-diag is handy for inspecting the XML in a Word document. Working with really small documents (like single paragraph or two-row table, for analysis purposes) makes it a lot easier to work out how Word is structuring the XML.

An important thing to note is that the XML elements in a Word document are sequence sensitive, meaning the child elements within any other element generally have a set order in which they must appear. If you get this swapped around, you get the "repair" error you mentioned.

I find it much easier to manipulate the XML from within python-docx, as it takes care of all the unzipping and rezipping for you, along with a lot of the other details.

To get the sequencing right, you'll need to be familiar with the XML Schema specifications for the elements you're working with. There is an example here: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/text/paragraph-format.html

The full schema is in the code tree under ref/xsd/. Most of the elements for text are in the wml.xsd file (wml stands for WordProcessing Markup Language).

You can find examples of other so-called "workaround functions" by searching on "python-docx" workaround function. Pay particular attention to the parse_xml() function and the OxmlElement objects which will allow you to create new XML subtrees and individual elements respectively. XML elements can be positioned using regular lxml._Element methods; all XML elements in python-docx are based on lxml. http://lxml.de/api/lxml.etree._Element-class.html

来源：https://stackoverflow.com/questions/46524872/how-can-i-insert-a-checkbox-form-into-a-docx-file-using-python-docx

标签

python

xml

checkbox

python-docx