Python-docx: identify a page break in paragraph

问题

I iterate over document by paragraphs, then I split each paragraph text into sentences by . (dot with space). I split paragraph text in sentences in order to do more effective text search compare to search in a whole paragraph text.

Then the code searches error in each word of sentence, error being taken from error-correction db. I show below a simplified code:

from docx.enum.text import WD_BREAK

for paragraph in document.paragraphs:
    sentences = paragraph.text.split('. ') 
    for sentence in sentences:
        words=sentence.split(' ')
        for word in words:
            for error in error_dictionary:
                 if error in word:
                     # (A) make simple replacement
                     word = word.replace(error, correction, 1)
                     # (B) alternative replacement based on runs 
                     for run in paragraph.runs:
                         if error in run.text:
                               run.text = run.text.replace(error, correction, 1)
                         # here we may fetch page break attribute and knowing current number 
                         # find out at what page the replacement has taken place 
                         if run.page_break== WD_BREAK:
                              current_page_number +=1
                     replace_counter += 1
                     # write to a report what paragraph and what page
                     write_report(error, correction, sentence, current_page_number )  
                     # for that I need to know a page break

The problem is how to identify if a run (or other paragraph element) contains a page break? Does run.page_break == WD_BREAK work? @scanny has showed how to add page break, but how to identify it?

The best would be if one can identify also a line break in paragraph.

I could make:

for run in paragraph.runs:
    if run._element.br_lst:             
        for br in run._element.br_lst:
            br_couter+=1
            print br.type

Yet this code shows only Hard breaks, that is, breaks inserted thru Ctrl+Enter. Soft page breaks are not detected... (Soft page break is formed when user keeps typing until the page he is on runs out then it flows on to the next page)

Any hints?

回答1:

There is no way to detect soft page breaks from a .docx file. The position of those is known only to the rendering engine and is not reflected in the .docx file itself. If you search here for '[python-docx] page break' or '[python-docx] TOC' you'll find a more elaborate explanation of this.

As to the first part of your question, this page from the technical analysis section of the python-docx documentation shows what breaks look like in the underlying XML:
https://python-docx.readthedocs.io/en/latest/dev/analysis/features/text/breaks.html#specimen-xml

There is no API support yet for explicitly finding breaks, although the run.text property indicates them with a \n line-feed character. The \n doesn't distinguish line breaks from page breaks however.

If you need to get more specific, you'll need to dig into the XML under each run and look for the specific break (w:br) elements you're interested in and their attributes:

>>> run._element.xml
<w:r>
  <w:t>Text before</w:t>
  <w:br/>
  <w:t>and after line break</w:t>
</w:r>

The run._element.br_lst approach you mention is a good one, then you just need to examine the attributes of each w:br to see if it has a w:type= attribute.

回答2:

For the Soft and Hard page breaks I now use the following:

for run in paragraph.runs:
    if 'lastRenderedPageBreak' in run._element.xml:  
        print 'soft page break found at run:', run.text[:20] 
    if 'w:br' in run._element.xml and 'type="page"' in run._element.xml:
        print 'hard page break found at run:', run.text[:20]

来源：https://stackoverflow.com/questions/53084249/python-docx-identify-a-page-break-in-paragraph

标签

python

python-docx

page-break