How to extract text under specific headings from a pdf?

I want to extract text under specific headings from a pdf using python.

For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'.

How can I do this?

This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I have the headings I enter the name of the heading for which I want to find the corresponding paragraph. This input is matched with the pre-existing list of headings and using universal sentence encoder I find the nearest match. After that I just display all the contents that is present from that heading upto the immediate next heading.

you can use PyPDF2 python library for that, below are some sample snippets by using PyPDF2

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# printing number of pages in pdf file
print(pdfReader.numPages)

# creating a page object
pageObj = pdfReader.getPage(0)

# extracting text from page
print(pageObj.extractText())

# closing the pdf file object
pdfFileObj.close()

来源：https://stackoverflow.com/questions/48107611/how-to-extract-text-under-specific-headings-from-a-pdf

标签

python-2.7

pdf

document

text-extraction

pdf-extraction

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!