How to extract text from a PDF file?

前端 未结 24 2030
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
24条回答
  •  醉酒成梦
    2020-11-22 14:25

    I was looking for a simple solution to use for python 3.x and windows. There doesn't seem to be support from textract, which is unfortunate, but if you are looking for a simple solution for windows/python 3 checkout the tika package, really straight forward for reading pdfs.

    Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

    from tika import parser # pip install tika
    
    raw = parser.from_file('sample.pdf')
    print(raw['content'])
    

    Note that Tika is written in Java so you will need a Java runtime installed

提交回复
热议问题