PDFminer: extract text with its font information

后端未结

关注

 6  1190

伪装坚强ぢ 2021-02-08 03:26

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information.

6条回答

有刺的猬 (楼主)

2021-02-08 04:05
This approach does not use PDFMiner but does the trick.

First, convert the PDF document into docx. Using python-docx you can then retrieve font information. Here's an example of getting all the bold text
```
from docx import *

document = Document('/path/to/file.docx')

for para in document.paragraphs:
    for run in para.runs:
        if run.bold:
            print run.text
```
If you really want to use PDFMiner you can try this. Passing '-t' would convert the PDF into HTML with all the font information.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...