I\'m trying to extract the text included in this PDF file using Python
.
I\'m using the PyPDF2 module, and have the following script:
imp
You can download tika-app-xxx.jar(latest) from Here.
Then put this .jar file in the same folder of your python script file.
then insert the following code in the script:
import os
import os.path
tika_dir=os.path.join(os.path.dirname(__file__),'.jar')
def extract_pdf(source_pdf:str,target_txt:str):
os.system('java -jar '+tika_dir+' -t {} > {}'.format(source_pdf,target_txt))
fewer dependency. Single .jar file is easier to manage that a python package.
multi-format support. The position source_pdf
can be the directory of any kind of document. (.doc, .html, .odt, etc.)
up-to-date. tika-app.jar always release earlier than the relevant version of tika python package.
stable. It is far more stable and well-maintained (Powered by Apache) than PyPDF.
A jre-headless is necessary.