How to extract text from a PDF file?

前端 未结 24 1984
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
相关标签:
24条回答
  • 2020-11-22 14:26

    After trying textract (which seemed to have too many dependencies) and pypdf2 (which could not extract text from the pdfs I tested with) and tika (which was too slow) I ended up using pdftotext from xpdf (as already suggested in another answer) and just called the binary from python directly (you may need to adapt the path to pdftotext):

    import os, subprocess
    SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
    args = ["/usr/local/bin/pdftotext",
            '-enc',
            'UTF-8',
            "{}/my-pdf.pdf".format(SCRIPT_DIR),
            '-']
    res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    output = res.stdout.decode('utf-8')
    

    There is pdftotext which does basically the same but this assumes pdftotext in /usr/local/bin whereas I am using this in AWS lambda and wanted to use it from the current directory.

    Btw: For using this on lambda you need to put the binary and the dependency to libstdc++.so into your lambda function. I personally needed to compile xpdf. As instructions for this would blow up this answer I put them on my personal blog.

    0 讨论(0)
  • 2020-11-22 14:28

    I've got a better work around than OCR and to maintain the page alignment while extracting the text from a PDF. Should be of help:

    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    from io import StringIO
    
    def convert_pdf_to_txt(path):
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        fp = open(path, 'rb')
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        maxpages = 0
        caching = True
        pagenos=set()
    
    
        for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)
    
    
        text = retstr.getvalue()
    
        fp.close()
        device.close()
        retstr.close()
        return text
    
    text= convert_pdf_to_txt('test.pdf')
    print(text)
    
    0 讨论(0)
  • 2020-11-22 14:34

    Look at this code:

    import PyPDF2
    pdf_file = open('sample.pdf', 'rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.getPage(0)
    page_content = page.extractText()
    print page_content.encode('utf-8')
    

    The output is:

    !"#$%#$%&%$&'()*%+,-%./01'*23%4
    5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
    %
    

    Using the same code to read a pdf from 201308FCR.pdf .The output is normal.

    Its documentation explains why:

    def extractText(self):
        """
        Locate all text drawing commands, in the order they are provided in the
        content stream, and extract the text.  This works well for some PDF
        files, but poorly for others, depending on the generator used.  This will
        be refined in the future.  Do not rely on the order of text coming out of
        this function, as it will change if this function is made more
        sophisticated.
        :return: a unicode string object.
        """
    
    0 讨论(0)
  • 2020-11-22 14:34

    Multi - page pdf can be extracted as text at single stretch instead of giving individual page number as argument using below code

    import PyPDF2
    import collections
    pdf_file = open('samples.pdf', 'rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    c = collections.Counter(range(number_of_pages))
    for i in c:
       page = read_pdf.getPage(i)
       page_content = page.extractText()
       print page_content.encode('utf-8')
    
    0 讨论(0)
  • 2020-11-22 14:34

    I found a solution here PDFLayoutTextStripper

    It's good because it can keep the layout of the original PDF.

    It's written in Java but I have added a Gateway to support Python.

    Sample code:

    from py4j.java_gateway import JavaGateway
    
    gw = JavaGateway()
    result = gw.entry_point.strip('samples/bus.pdf')
    
    # result is a dict of {
    #   'success': 'true' or 'false',
    #   'payload': pdf file content if 'success' is 'true'
    #   'error': error message if 'success' is 'false'
    # }
    
    print result['payload']
    

    Sample output from PDFLayoutTextStripper:

    You can see more details here Stripper with Python

    0 讨论(0)
  • 2020-11-22 14:35

    You can download tika-app-xxx.jar(latest) from Here.

    Then put this .jar file in the same folder of your python script file.

    then insert the following code in the script:

    import os
    import os.path
    
    tika_dir=os.path.join(os.path.dirname(__file__),'<tika-app-xxx>.jar')
    
    def extract_pdf(source_pdf:str,target_txt:str):
        os.system('java -jar '+tika_dir+' -t {} > {}'.format(source_pdf,target_txt))
    

    The advantage of this method:

    fewer dependency. Single .jar file is easier to manage that a python package.

    multi-format support. The position source_pdf can be the directory of any kind of document. (.doc, .html, .odt, etc.)

    up-to-date. tika-app.jar always release earlier than the relevant version of tika python package.

    stable. It is far more stable and well-maintained (Powered by Apache) than PyPDF.

    disadvantage:

    A jre-headless is necessary.

    0 讨论(0)
提交回复
热议问题