I\'m trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius
The solution given by Chris Mattmann is right. However, I would like to add a couple of inputs. Use the following code snippet to write the read PDF file into a text file. Use appropriate encoding to support UTF-8 (for example, Chinese/Japanese characters).
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tika
tika.initVM()
from tika import parser
parsed_pdf = parser.from_file('file.pdf')
with open('file.txt', 'w', encoding='utf-8') as file:
file.write(parsed_pdf["content"])