How to extract text under specific headings from a pdf?

风格不统一 提交于 2019-12-04 23:22:45

问题


I want to extract text under specific headings from a pdf using python.

For example, I have a pdf with headings Introduction,Summary,Contents. I need to extract only the text under the heading 'Summary'.

How can I do this?


回答1:


This scenario is exactly what I am working on in my current company. We need to extract text lying under a heading. I'm personally using a rule based system i.e, using regex to identify all the numbered headings after reading the entire document line by line. Once I have the headings I enter the name of the heading for which I want to find the corresponding paragraph. This input is matched with the pre-existing list of headings and using universal sentence encoder I find the nearest match. After that I just display all the contents that is present from that heading upto the immediate next heading.




回答2:


you can use PyPDF2 python library for that, below are some sample snippets by using PyPDF2

# importing required modules
import PyPDF2

# creating a pdf file object
pdfFileObj = open('example.pdf', 'rb')

# creating a pdf reader object
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

# printing number of pages in pdf file
print(pdfReader.numPages)

# creating a page object
pageObj = pdfReader.getPage(0)

# extracting text from page
print(pageObj.extractText())

# closing the pdf file object
pdfFileObj.close()


来源:https://stackoverflow.com/questions/48107611/how-to-extract-text-under-specific-headings-from-a-pdf

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!