How to extract text from pdf in Python 3.7

后端 未结 10 1182
后悔当初
后悔当初 2020-12-29 10:19

I am trying to extract text from a PDF file using Python. My main goal is I am trying to create a program that reads a bank statement and extracts its text to update an exce

相关标签:
10条回答
  • 2020-12-29 11:07

    If you are looking for a maintained, bigger project, have a look at PyMuPDF. Install it with pip install pymupdf and use it like this:

    import fitz
    
    def get_text(filepath: str) -> str:
        with fitz.open(filepath) as doc:
            text = ""
            for page in doc:
                text += page.getText().strip()
            return text
    
    0 讨论(0)
  • 2020-12-29 11:07
    import pdftables_api
    import os
    
    c = pdftables_api.Client('MY-API-KEY')
    
    file_path = "C:\\Users\\MyName\\Documents\\PDFTablesCode\\"
    
    for file in os.listdir(file_path):
        if file.endswith(".pdf"):
            c.xlsx(os.path.join(file_path,file), file+'.xlsx')
    

    Go to https://pdftables.com to get an API key.

    CSV, format=csv

    XML, format=xml

    HTML, format=html

    XLSX, format=xlsx-single, format=xlsx-multiple

    0 讨论(0)
  • 2020-12-29 11:10

    Here is an alternative solution in Windows 10, Python 3.8

    Example test pdf: https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing

    #pip install pdfminer.six
    import io
    
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    
    
    def convert_pdf_to_txt(path):
        '''Convert pdf content from a file path to text
    
        :path the file path
        '''
        rsrcmgr = PDFResourceManager()
        codec = 'utf-8'
        laparams = LAParams()
    
        with io.StringIO() as retstr:
            with TextConverter(rsrcmgr, retstr, codec=codec,
                               laparams=laparams) as device:
                with open(path, 'rb') as fp:
                    interpreter = PDFPageInterpreter(rsrcmgr, device)
                    password = ""
                    maxpages = 0
                    caching = True
                    pagenos = set()
    
                    for page in PDFPage.get_pages(fp,
                                                  pagenos,
                                                  maxpages=maxpages,
                                                  password=password,
                                                  caching=caching,
                                                  check_extractable=True):
                        interpreter.process_page(page)
    
                    return retstr.getvalue()
    
    
    if __name__ == "__main__":
        print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))
    
    0 讨论(0)
  • 2020-12-29 11:13

    PyPDF2 does not read whole pdf correctly. You must use this code.

        import pdftotext
    
        pdfFileObj = open("January2019.pdf", 'rb')
    
    
        pdf = pdftotext.PDF(pdfFileObj)
    
        # Iterate over all the pages
        for page in pdf:
            print(page)
    
    0 讨论(0)
提交回复
热议问题