Extracting text from PDF in Python

白昼怎懂夜的黑 提交于 2019-12-24 10:57:56

问题


I have a PDF full of quotes:

https://www.pdf-archive.com/2017/03/22/test/

I can extract the text in python using the following code:

import PyPDF2

pdfFileObj = open('example.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
print (pageObj.extractText())

This returns all the quotes as one paragraph. Is it possible to 'split' the pdf by the horizontal separator and split it into quotes that way?


回答1:


If you want to just extract the quotes from the pdf text you can use regex to find all the quotes.

import PyPDF2
import re
pdfFileObj = open('test.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
text = str(pageObj.extractText())

quotes = re.findall(r'"[^"]*"',text)
for quote in quotes:
    print quote
    print 

or just

quotes = re.findall(r'"[^"]*"',text)
print quotes



回答2:


i could not find a way to split it by the horizontal separator, but i managed to do it in another way:

import PyPDF2

quotes = []

pdfFileObj = open('test.pdf','rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)         
for x in (pageObj.extractText()).split('"\n'): print x+"\n"*5



回答3:


import pdfplumber

pdf = pdfplumber.open(file_path)

p0 = pdf.pages[0]

text = p0.extract_text()

text


来源:https://stackoverflow.com/questions/42962811/extracting-text-from-pdf-in-python

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!