How to extract text from a PDF file?

前端 未结 24 1998
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
24条回答
  •  情歌与酒
    2020-11-22 14:24

    A more robust way, supposing there are multiple PDF's or just one !

    import os
    from PyPDF2 import PdfFileWriter, PdfFileReader
    from io import BytesIO
    
    mydir = # specify path to your directory where PDF or PDF's are
    
    for arch in os.listdir(mydir): 
        buffer = io.BytesIO()
        archpath = os.path.join(mydir, arch)
        with open(archpath) as f:
                pdfFileObj = open(archpath, 'rb')
                pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
                pdfReader.numPages
                pageObj = pdfReader.getPage(0) 
                ley = pageObj.extractText()
                file1 = open("myfile.txt","w")
                file1.writelines(ley)
                file1.close()
                
    

提交回复
热议问题