PDFminer: extract text with its font information

后端 未结 6 1172
伪装坚强ぢ
伪装坚强ぢ 2021-02-08 03:26

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information.

6条回答
  •  有刺的猬
    2021-02-08 04:05

    This approach does not use PDFMiner but does the trick.

    First, convert the PDF document into docx. Using python-docx you can then retrieve font information. Here's an example of getting all the bold text

    from docx import *
    
    document = Document('/path/to/file.docx')
    
    for para in document.paragraphs:
        for run in para.runs:
            if run.bold:
                print run.text
    

    If you really want to use PDFMiner you can try this. Passing '-t' would convert the PDF into HTML with all the font information.

提交回复
热议问题