Pdfminer python 3.5

后端 未结 4 1140
误落风尘
误落风尘 2020-12-14 17:54

I have followed a few tutorials around but I am not able to get this code block to run, I did the necessary switches from StringIO to BytesIO (I believe?)

I am unsur

相关标签:
4条回答
  • 2020-12-14 18:03

    There is a solution for Python 3.5: you need pdfminer.six. Under win10 I could easy install it with

    pip install pdfminer.six
    

    You can check the installed version with

    pdfminer.__version__
    

    I haven't tested it intensively yet. But I could run the following code for the conversion pdf→text and pdf→html

    0 讨论(0)
  • 2020-12-14 18:17

    pdfminer doesn't support python version 3.5. It works only in Python 2.6 or newer. I faced the same issue try using python version 2.6 it will solve your problem.

    0 讨论(0)
  • 2020-12-14 18:23

    In my case on Python 3.7 I tried using it and it worked like a charm for me!

    here is the code I used:

    def convert_pdf_to_txt(path_to_file):
        rsrcmgr = PDFResourceManager()
        retstr = StringIO()
        codec = 'utf-8'
        laparams = LAParams()
        device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
        fp = open(path_to_file, 'rb')
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        password = ""
        maxpages = 0
        caching = True
        pagenos=set()
    
        for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)
    
        text = retstr.getvalue()
    
        fp.close()
        device.close()
        retstr.close()
        return text
    
    0 讨论(0)
  • 2020-12-14 18:26

    Improved solution (Dez 2016)

    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
    from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfpage import PDFPage
    import io
    
    def convert(case,fname, pages=None):
        if not pages: pagenums = set();
        else:         pagenums = set(pages);      
        manager = PDFResourceManager() 
        codec = 'utf-8'
        caching = True
    
        if case == 'text' :
            output = io.StringIO()
            converter = TextConverter(manager, output, codec=codec, laparams=LAParams())     
        if case == 'HTML' :
            output = io.BytesIO()
            converter = HTMLConverter(manager, output, codec=codec, laparams=LAParams())
    
        interpreter = PDFPageInterpreter(manager, converter)   
        infile = open(fname, 'rb')
    
        for page in PDFPage.get_pages(infile, pagenums,caching=caching, check_extractable=True):
            interpreter.process_page(page)
    
        convertedPDF = output.getvalue()  
    
        infile.close(); converter.close(); output.close()
        return convertedPDF
    
    #//////////// main ///////////////////////
    filePDF  = 'myDir//myPDF.pdf'     # input
    fileHTML = 'myDir//myHTML.html'   # output
    fileTXT  = 'myDir//myTXT.txt'     # output
    
    case = "HTML"
    
    if case == 'HTML' :
        convertedPDF = convert('HTML', filePDF, pages=[0,1])
        fileConverted = open(fileHTML, "wb", encoding="utf-8")
    if case == 'text' :
        convertedPDF = convert('text', filePDF, pages=[0,1])
        fileConverted = open(fileTXT, "w", encoding="utf-8")
    
    fileConverted.write(convertedPDF)
    fileConverted.close()
    #print(convertedPDF) 
    
    0 讨论(0)
提交回复
热议问题