How to extract text from a PDF file?

前端 未结 24 1979
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
24条回答
  •  伪装坚强ぢ
    2020-11-22 14:35

    You can download tika-app-xxx.jar(latest) from Here.

    Then put this .jar file in the same folder of your python script file.

    then insert the following code in the script:

    import os
    import os.path
    
    tika_dir=os.path.join(os.path.dirname(__file__),'.jar')
    
    def extract_pdf(source_pdf:str,target_txt:str):
        os.system('java -jar '+tika_dir+' -t {} > {}'.format(source_pdf,target_txt))
    

    The advantage of this method:

    fewer dependency. Single .jar file is easier to manage that a python package.

    multi-format support. The position source_pdf can be the directory of any kind of document. (.doc, .html, .odt, etc.)

    up-to-date. tika-app.jar always release earlier than the relevant version of tika python package.

    stable. It is far more stable and well-maintained (Powered by Apache) than PyPDF.

    disadvantage:

    A jre-headless is necessary.

提交回复
热议问题