How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

前端 未结 5 1320
南旧
南旧 2021-01-01 07:48

I\'m trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius

相关标签:
5条回答
  • 2021-01-01 08:12

    can you please share the file you are looking at? The easiest way to do this would be to perhaps attach it to a Github issue in my repository, etc.

    That said, if you are trying to use OCR and Tika, you need to run through the Tika OCR guide (http://wiki.apache.org/tika/TikaOCR) and get Tesseract installed. Once Tesseract is installed, then you need to double check whether or not you have an instance of tika-server running (e.g., ps aux | grep tika). If you do, kill it (tika-python runs the Tika REST server in the background as its main interface to Tika; having a fresh running version of it after Tesseract OCR is installed helps to eliminate any odd possibilities).

    After you have Tesseract OCR installed, no tika-server running, start your python2.7 interpreter (or script), and then do something like:

    from tika import parser
    parsed = parser.from_file('/path/to/file')
    print parsed["content"] # should be the text returned from OCR
    

    HTH! --Chris

    0 讨论(0)
  • 2021-01-01 08:23

    You need to download the Tika Server Jar and run it first. Check this link: http://wiki.apache.org/tika/TikaJAXRS

    1. Download the Jar
    2. Store it somewhere and run it as java -jar tika-server-x.x.jar --port xxxx
    3. In your Code you now don't need to do the tika.initVM() Add tika.TikaClientOnly = True instead of tika.initVM()
    4. Change parsed = parser.from_file('/path/to/file') to parsed = parser.from_file('/path/to/file', '/path/to/server') You will get the server path in Step 2. when the tika server initiates - just plug that in here

    Good luck!

    0 讨论(0)
  • 2021-01-01 08:23

    The solution given by Chris Mattmann is right. However, I would like to add a couple of inputs. Use the following code snippet to write the read PDF file into a text file. Use appropriate encoding to support UTF-8 (for example, Chinese/Japanese characters).

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import tika
    
    tika.initVM()
    from tika import parser
    
    parsed_pdf = parser.from_file('file.pdf')
    
    with open('file.txt', 'w', encoding='utf-8') as file:
         file.write(parsed_pdf["content"])
    
    0 讨论(0)
  • 2021-01-01 08:24

    I never tried python tikq , but pyjnius is working fine for me. Here is my code:

    def parse_file(filename):
       """
       Import TIKA classes and parse input filename
       """
    
       import os
       os.environ['CLASSPATH'] = "/path/to/tika-app.jar"
       from jnius import autoclass
       from jnius import JavaException
    
       # Import the Java classes 
       Tika = autoclass('org.apache.tika.Tika')
       Metadata = autoclass('org.apache.tika.metadata.Metadata')
       FileInputStream = autoclass('java.io.FileInputStream')
    
       tika = Tika()
       tika.setMaxStringLength(10*1024*1024);
       meta = Metadata()
    
       # Raise an exception and continue if parsing fails
       try:
           text = tika.parseToString(FileInputStream(filename), meta)
           return text
       except (JavaException,UnicodeDecodeError), e:
           print "ERROR: %s" % (e)
       return None
    
    0 讨论(0)
  • 2021-01-01 08:33

    Install tika with the following pip command:

    pip install tika
    

    The following code works fine for extracting data:

    import io
    import os
    from tika import parser
    
    def extract_text(file):
        parsed = parser.from_file(file)
        parsed_text = parsed['content']
        parsed_text = parsed_text.lower()
        return parsed_text
    
    file_name_with_extension = input("Enter File Name:")
    text = extract_text(file_name_with_extension)
    print(text)
    

    It will print only content of the file. Supported file formats are listed here.

    0 讨论(0)
提交回复
热议问题