问题
I am working on a Python project, where I need to process some PDF research papers' data. I'm able to parse papers, extract data from them and identify sections using PyPDF2
.
import PyPDF2
pdfFileObj = open('fileName.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageCount = pdfReader.numPages
count = 0
text = ''
while count < pageCount:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
Every paper contains References at the end of paper, which I'm able to get, but Some papers also have some data after References. It could be any thing i.e. text/ images/ tables, may or may not start with heading.Check this and this paper as Reference.
Here is some portion How I'm getting References and parsing them but now I've all random data in references, and I'm stuck how to separate references from all extra stuff after that.
Any kind of help will be appreciated.
来源:https://stackoverflow.com/questions/62542857/ignore-all-data-after-references-python