问题
I have a word document(.docx file) consisting of 10 pages with 1 paragraph on each page where each page/paragraph is seperated by a pagebreak. I want to read the text in the docx file and split it with the page breaks.
I am able to read the text with python-docx library but I am not sure how to split it with page break. I can see a similar question but it's solution was proposed using the old python-docx library.
Here's the code for reading text from docx file :
from docx import Document
paratextlist = Document("ex.docx")
docText = '\n'.join([
paragraph.text for paragraph in paratextlist.paragraphs
])
回答1:
Can use regex to search for form fill character \f I think.
import re
pattern = re.compile(r"\f")
matches = pattern.finditer(text)
for match in matches:
print(f"Page break occurs at character {match.span()[0]}")
If 'text' is your document string, this would return the location of each pagebreak in the string. You could then break it up using those indices.
This could probably be adapted using the Document object, but I'm not 100% on how.
来源:https://stackoverflow.com/questions/49737926/how-to-split-text-read-from-a-docx-file-with-page-breaks-using-python3-docx