Page number python-docx

血红的双手。 提交于 2019-12-17 12:41:32

问题


I am trying to create a program in python that can find a specific word in a .docx file and return page number that it occurred on. So far, in looking through the python-docx documentation I have been unable to find how do access the page number or even the footer where the number would be located. Is there a way to do this using python-docx or even just python? Or if not, what would be the best way to do this?


回答1:


Short answer is no, because the page breaks are inserted by the rendering engine, not determined by the .docx file itself.

However, certain clients place a <w:lastRenderedPageBreak> element in the saved XML to indicate where they broke the page last time it was rendered.

I don't know which do this (although I expect Word itself does) and how reliable it is, but that's the direction I would recommend if you wanted to work in Python. You could potentially use python-docx to get a reference to the lxml element you want (like w:document/w:body) and then use XPath commands or something to iterate through to a specific page, but just thinking it through a bit it's going to be some detailed development there to get that working.

If you work in the native Windows MS Office API you might be able to get something better since it actually runs the Word application.

If you're generating the documents in python-docx, those elements won't be placed because it makes no attempt to render the document (nor is it ever likely to). We're also not likely to add support for w:lastRenderedPageBreak anytime soon; I'm not even quite sure what that would look like.

If you search on 'lastRenderedPageBreak' and/or 'python-docx page break' you'll see other questions/answers here that may give a little more.




回答2:


Using Python-docx: identify a page break in paragraph

from docx import Document
fn='1.doc'
document = Document(fn)
pn=1    
import re
for p in document.paragraphs:
    r=re.match('Chapter \d+',p.text)
    if r:
        print(r.group(),pn)
    for run in p.runs:
        if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
            pn+=1
            print('!!','='*50,pn)


来源:https://stackoverflow.com/questions/36193159/page-number-python-docx

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!