I\'m trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius
can you please share the file you are looking at? The easiest way to do this would be to perhaps attach it to a Github issue in my repository, etc.
That said, if you are trying to use OCR and Tika, you need to run through the Tika OCR guide (http://wiki.apache.org/tika/TikaOCR) and get Tesseract installed. Once Tesseract is installed, then you need to double check whether or not you have an instance of tika-server running (e.g., ps aux | grep tika). If you do, kill it (tika-python runs the Tika REST server in the background as its main interface to Tika; having a fresh running version of it after Tesseract OCR is installed helps to eliminate any odd possibilities).
After you have Tesseract OCR installed, no tika-server running, start your python2.7 interpreter (or script), and then do something like:
from tika import parser
parsed = parser.from_file('/path/to/file')
print parsed["content"] # should be the text returned from OCR
HTH! --Chris
You need to download the Tika Server Jar and run it first. Check this link: http://wiki.apache.org/tika/TikaJAXRS
java -jar tika-server-x.x.jar --port xxxx
tika.initVM()
Add tika.TikaClientOnly = True
instead of tika.initVM()
parsed = parser.from_file('/path/to/file')
to
parsed = parser.from_file('/path/to/file', '/path/to/server')
You will get the server path in Step 2. when the tika server initiates - just plug that in hereGood luck!
The solution given by Chris Mattmann is right. However, I would like to add a couple of inputs. Use the following code snippet to write the read PDF file into a text file. Use appropriate encoding to support UTF-8 (for example, Chinese/Japanese characters).
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import tika
tika.initVM()
from tika import parser
parsed_pdf = parser.from_file('file.pdf')
with open('file.txt', 'w', encoding='utf-8') as file:
file.write(parsed_pdf["content"])
I never tried python tikq , but pyjnius is working fine for me. Here is my code:
def parse_file(filename):
"""
Import TIKA classes and parse input filename
"""
import os
os.environ['CLASSPATH'] = "/path/to/tika-app.jar"
from jnius import autoclass
from jnius import JavaException
# Import the Java classes
Tika = autoclass('org.apache.tika.Tika')
Metadata = autoclass('org.apache.tika.metadata.Metadata')
FileInputStream = autoclass('java.io.FileInputStream')
tika = Tika()
tika.setMaxStringLength(10*1024*1024);
meta = Metadata()
# Raise an exception and continue if parsing fails
try:
text = tika.parseToString(FileInputStream(filename), meta)
return text
except (JavaException,UnicodeDecodeError), e:
print "ERROR: %s" % (e)
return None
Install tika with the following pip command:
pip install tika
The following code works fine for extracting data:
import io
import os
from tika import parser
def extract_text(file):
parsed = parser.from_file(file)
parsed_text = parsed['content']
parsed_text = parsed_text.lower()
return parsed_text
file_name_with_extension = input("Enter File Name:")
text = extract_text(file_name_with_extension)
print(text)
It will print only content of the file. Supported file formats are listed here.