How to extract text from pdf in Python 3.7

后端未结

关注

 10  1187

I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an exce

相关标签:

10条回答

猫巷女王i

2020-12-29 10:52

Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":

from pdfreader import SimplePDFViewer, PageDoesNotExist

fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)

plain_text = ""
pdf_markdown = ""

try:
    while True:
        viewer.render()
        pdf_markdown += viewer.canvas.text_content
        plain_text += "".join(viewer.canvas.strings)
        viewer.next()
except PageDoesNotExist:
    pass

0 讨论(0)

时光说笑

2020-12-29 10:58

import PyPDF2
pdf-file = open('January2019.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdf-file)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

0 讨论(0)

南笙

2020-12-29 10:59
Using tika worked for me!
```
from tika import parser

rawText = parser.from_file('January2019.pdf')

rawList = rawText['content'].splitlines()
```
This made it really easy to extract separate each line in the bank statement into a list.
0 讨论(0)
发布评论:

提交评论
- 加载中...
面向向阳花

2020-12-29 11:01
PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too. it says :

While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.
1. You could instead install and use pdfminer using
  
  pip install pdfminer
2. or you can use another open source utility named pdftotext by xpdfreader. instructions to use the utility is given on the page.
you can download the command line tools from here and could use the pdftotext.exe utility using subprocess .detailed explanation for using subprocess is given here
0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2020-12-29 11:04
I have tried many methods but failed, include PyPDF2 and Tika. I finally found the module pdfplumber that is work for me, you also can try it.

Hope this will be helpful to you.
```
import pdfplumber
pdf = pdfplumber.open('pdffile.pdf')
page = pdf.pages[0]
text = page.extract_text()
print(text)
pdf.close()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

醉话见心

2020-12-29 11:04

try this :

in trminal : pip install PyPDF2

import PyPDF2
pdfFileObject = open('mypdf.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

0 讨论(0)

1 2 下一页