How can I use Tika package(https://github.com/chrismattmann/tika-python) in python(2.7) to parse PDF files?

前端未结

关注

 5  1319

南旧 2021-01-01 07:48

I\'m trying to parse a few PDF files that contain engineering drawings to obtain text data in the files. I tried using TIKA as a jar with python and using it with the jnius

5条回答

生来不讨喜 (楼主)

2021-01-01 08:23
The solution given by Chris Mattmann is right. However, I would like to add a couple of inputs. Use the following code snippet to write the read PDF file into a text file. Use appropriate encoding to support UTF-8 (for example, Chinese/Japanese characters).
```
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import tika

tika.initVM()
from tika import parser

parsed_pdf = parser.from_file('file.pdf')

with open('file.txt', 'w', encoding='utf-8') as file:
     file.write(parsed_pdf["content"])
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...