How to extract text from a PDF file?

前端 未结 24 1983
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
相关标签:
24条回答
  • 2020-11-22 14:35

    PyPDF2 does work, but results may vary. I am seeing quite inconsistent findings from its result extraction.

    reader=PyPDF2.pdf.PdfFileReader(self._path)
    eachPageText=[]
    for i in range(0,reader.getNumPages()):
        pageText=reader.getPage(i).extractText()
        print(pageText)
        eachPageText.append(pageText)
    
    0 讨论(0)
  • 2020-11-22 14:36

    PyPDF2 in some cases ignores the white spaces and makes the result text a mess, but I use PyMuPDF and I'm really satisfied you can use this link for more info

    0 讨论(0)
  • 2020-11-22 14:38

    pdftotext is the best and simplest one! pdftotext also reserves the structure as well.

    I tried PyPDF2, PDFMiner and a few others but none of them gave a satisfactory result.

    0 讨论(0)
  • 2020-11-22 14:38

    Here is the simplest code for extracting text

    code:

    # importing required modules
    import PyPDF2
    
    # creating a pdf file object
    pdfFileObj = open('filename.pdf', 'rb')
    
    # creating a pdf reader object
    pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
    
    # printing number of pages in pdf file
    print(pdfReader.numPages)
    
    # creating a page object
    pageObj = pdfReader.getPage(5)
    
    # extracting text from page
    print(pageObj.extractText())
    
    # closing the pdf file object
    pdfFileObj.close()
    
    0 讨论(0)
  • 2020-11-22 14:39

    Use textract.

    • http://textract.readthedocs.io/en/latest/
    • https://github.com/deanmalmgren/textract

    It supports many types of files including PDFs

    import textract
    text = textract.process("path/to/file.extension")
    
    0 讨论(0)
  • 2020-11-22 14:41

    The below code is a solution to the question in Python 3. Before running the code, make sure you have installed the PyPDF2 library in your environment. If not installed, open the command prompt and run the following command:

    pip3 install PyPDF2
    

    Solution Code:

    import PyPDF2
    pdfFileObject = open('sample.pdf', 'rb')
    pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
    count = pdfReader.numPages
    for i in range(count):
        page = pdfReader.getPage(i)
        print(page.extractText())
    
    0 讨论(0)
提交回复
热议问题