问题
I have a PDF full of quotes:
https://www.pdf-archive.com/2017/03/22/test/
I can extract the text in python using the following code:
import PyPDF2
pdfFileObj = open('example.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print (pageObj.extractText())
This returns all the quotes as one paragraph. Is it possible to 'split' the pdf by the horizontal separator and split it into quotes that way?
回答1:
If you want to just extract the quotes from the pdf text you can use regex
to find all the quotes.
import PyPDF2
import re
pdfFileObj = open('test.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text = str(pageObj.extractText())
quotes = re.findall(r'"[^"]*"',text)
for quote in quotes:
print quote
print
or just
quotes = re.findall(r'"[^"]*"',text)
print quotes
回答2:
i could not find a way to split it by the horizontal separator, but i managed to do it in another way:
import PyPDF2
quotes = []
pdfFileObj = open('test.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
for x in (pageObj.extractText()).split('"\n'): print x+"\n"*5
回答3:
import pdfplumber
pdf = pdfplumber.open(file_path)
p0 = pdf.pages[0]
text = p0.extract_text()
text
来源:https://stackoverflow.com/questions/42962811/extracting-text-from-pdf-in-python