How to extract text from pdf in Python 3.7

后端 未结 10 1181
后悔当初
后悔当初 2020-12-29 10:19

I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an exce

相关标签:
10条回答
  • 2020-12-29 10:52

    Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":

    from pdfreader import SimplePDFViewer, PageDoesNotExist
    
    fd = open(you_pdf_file_name, "rb")
    viewer = SimplePDFViewer(fd)
    
    plain_text = ""
    pdf_markdown = ""
    
    try:
        while True:
            viewer.render()
            pdf_markdown += viewer.canvas.text_content
            plain_text += "".join(viewer.canvas.strings)
            viewer.next()
    except PageDoesNotExist:
        pass
    
    
    0 讨论(0)
  • 2020-12-29 10:58
    import PyPDF2
    pdf-file = open('January2019.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdf-file)
    count = pdfReader.numPages
    for i in range(count):
        page = pdfReader.getPage(i)
        print(page.extractText())
    
    0 讨论(0)
  • 2020-12-29 10:59

    Using tika worked for me!

    from tika import parser
    
    rawText = parser.from_file('January2019.pdf')
    
    rawList = rawText['content'].splitlines()
    

    This made it really easy to extract separate each line in the bank statement into a list.

    0 讨论(0)
  • 2020-12-29 11:01

    PyPDF2 is highly unreliable for extracting text from pdf . as pointed out here too. it says :

    While PyPDF2 has .extractText(), which can be used on its page objects (not shown in this example), it does not work very well. Some PDFs will return text and some will return an empty string. When you want to extract text from a PDF, you should check out the PDFMiner project instead. PDFMiner is much more robust and was specifically designed for extracting text from PDFs.

    1. You could instead install and use pdfminer using

      pip install pdfminer

    2. or you can use another open source utility named pdftotext by xpdfreader. instructions to use the utility is given on the page.

    you can download the command line tools from here and could use the pdftotext.exe utility using subprocess .detailed explanation for using subprocess is given here

    0 讨论(0)
  • 2020-12-29 11:04

    I have tried many methods but failed, include PyPDF2 and Tika. I finally found the module pdfplumber that is work for me, you also can try it.

    Hope this will be helpful to you.

    import pdfplumber
    pdf = pdfplumber.open('pdffile.pdf')
    page = pdf.pages[0]
    text = page.extract_text()
    print(text)
    pdf.close()
    
    0 讨论(0)
  • 2020-12-29 11:04

    try this :

    in trminal : pip install PyPDF2

    import PyPDF2
    pdfFileObject = open('mypdf.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
    count = pdfReader.numPages
    for i in range(count):
        page = pdfReader.getPage(i)
        print(page.extractText())
    
    0 讨论(0)
提交回复
热议问题