How to extract text from a PDF file?

前端 未结 24 1982
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
相关标签:
24条回答
  • 2020-11-22 14:22

    If you try it in Anaconda on Windows, PyPDF2 might not handle some of the PDFs with non-standard structure or unicode characters. I recommend using the following code if you need to open and read a lot of pdf files - the text of all pdf files in folder with relative path .//pdfs// will be stored in list pdf_text_list.

    from tika import parser
    import glob
    
    def read_pdf(filename):
        text = parser.from_file(filename)
        return(text)
    
    
    all_files = glob.glob(".\\pdfs\\*.pdf")
    pdf_text_list=[]
    for i,file in enumerate(all_files):
        text=read_pdf(file)
        pdf_text_list.append(text['content'])
    
    print(pdf_text_list)
    
    0 讨论(0)
  • 2020-11-22 14:23

    You may want to use time proved xPDF and derived tools to extract text instead as pyPDF2 seems to have various issues with the text extraction still.

    The long answer is that there are lot of variations how a text is encoded inside PDF and that it may require to decoded PDF string itself, then may need to map with CMAP, then may need to analyze distance between words and letters etc.

    In case the PDF is damaged (i.e. displaying the correct text but when copying it gives garbage) and you really need to extract text, then you may want to consider converting PDF into image (using ImageMagik) and then use Tesseract to get text from image using OCR.

    0 讨论(0)
  • 2020-11-22 14:23

    In 2020 the solutions above were not working for the particular pdf I was working with. Below is what did the trick. I am on Windows 10 and Python 3.8

    Test pdf file: https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing

    #pip install pdfminer.six
    import io
    
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    
    
    def convert_pdf_to_txt(path):
        '''Convert pdf content from a file path to text
    
        :path the file path
        '''
        rsrcmgr = PDFResourceManager()
        codec = 'utf-8'
        laparams = LAParams()
    
        with io.StringIO() as retstr:
            with TextConverter(rsrcmgr, retstr, codec=codec,
                               laparams=laparams) as device:
                with open(path, 'rb') as fp:
                    interpreter = PDFPageInterpreter(rsrcmgr, device)
                    password = ""
                    maxpages = 0
                    caching = True
                    pagenos = set()
    
                    for page in PDFPage.get_pages(fp,
                                                  pagenos,
                                                  maxpages=maxpages,
                                                  password=password,
                                                  caching=caching,
                                                  check_extractable=True):
                        interpreter.process_page(page)
    
                    return retstr.getvalue()
    
    
    if __name__ == "__main__":
        print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf')) 
    
    0 讨论(0)
  • 2020-11-22 14:24

    I recommend to use pymupdf or pdfminer.six.

    Those packages are not maintained:

    • PyPDF2, PyPDF3, PyPDF4
    • pdfminer (without .six)

    How to read pure text with pymupdf

    There are different options which will give different results, but the most basic one is:

    import fitz  # this is pymupdf
    
    with fitz.open("my.pdf") as doc:
        text = ""
        for page in doc:
            text += page.getText()
    
    print(text)
    
    0 讨论(0)
  • 2020-11-22 14:24

    A more robust way, supposing there are multiple PDF's or just one !

    import os
    from PyPDF2 import PdfFileWriter, PdfFileReader
    from io import BytesIO
    
    mydir = # specify path to your directory where PDF or PDF's are
    
    for arch in os.listdir(mydir): 
        buffer = io.BytesIO()
        archpath = os.path.join(mydir, arch)
        with open(archpath) as f:
                pdfFileObj = open(archpath, 'rb')
                pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
                pdfReader.numPages
                pageObj = pdfReader.getPage(0) 
                ley = pageObj.extractText()
                file1 = open("myfile.txt","w")
                file1.writelines(ley)
                file1.close()
                
    
    0 讨论(0)
  • 2020-11-22 14:25

    I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

    Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

    from tika import parser # pip install tika
    
    raw = parser.from_file('sample.pdf')
    print(raw['content'])
    

    Note that Tika is written in Java so you will need a Java runtime installed

    0 讨论(0)
提交回复
热议问题