How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

前端 未结 5 1319
南旧
南旧 2021-01-01 07:48

I\'m trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius

5条回答
  •  生来不讨喜
    2021-01-01 08:23

    The solution given by Chris Mattmann is right. However, I would like to add a couple of inputs. Use the following code snippet to write the read PDF file into a text file. Use appropriate encoding to support UTF-8 (for example, Chinese/Japanese characters).

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import tika
    
    tika.initVM()
    from tika import parser
    
    parsed_pdf = parser.from_file('file.pdf')
    
    with open('file.txt', 'w', encoding='utf-8') as file:
         file.write(parsed_pdf["content"])
    

提交回复
热议问题