How to split text read from a docx file with Page breaks using python3 docx

依然范特西╮ 提交于 2020-01-16 17:20:00

问题


I have a word document(.docx file) consisting of 10 pages with 1 paragraph on each page where each page/paragraph is seperated by a pagebreak. I want to read the text in the docx file and split it with the page breaks.

I am able to read the text with python-docx library but I am not sure how to split it with page break. I can see a similar question but it's solution was proposed using the old python-docx library.

Here's the code for reading text from docx file :

from docx import Document

paratextlist = Document("ex.docx")
docText = '\n'.join([
    paragraph.text for paragraph in paratextlist.paragraphs
])

回答1:


Can use regex to search for form fill character \f I think.

import re

pattern = re.compile(r"\f")
matches = pattern.finditer(text)
for match in matches:
    print(f"Page break occurs at character {match.span()[0]}")

If 'text' is your document string, this would return the location of each pagebreak in the string. You could then break it up using those indices.

This could probably be adapted using the Document object, but I'm not 100% on how.



来源:https://stackoverflow.com/questions/49737926/how-to-split-text-read-from-a-docx-file-with-page-breaks-using-python3-docx

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!