PDFminer: extract text with its font information

后端未结

关注

 6  1189

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information.

相关标签:

6条回答

执笔经年

2021-02-08 03:47

Have a look at PDFlib, it can extract font info as you require and has a Python library you can import in your scripts and work with it.

0 讨论(0)
发布评论:

提交评论
- 加载中...
小蘑菇

2021-02-08 03:49

Some informations are in lower level, in the LTChar class. It seems logic because font size, italic, bold, etc, can be applied to a single character.

More infos here : https://github.com/euske/pdfminer/blob/master/pdfminer/layout.py#L222

But I'm still confuse about font color not in this class

0 讨论(0)
发布评论:

提交评论
- 加载中...

旧巷少年郎

2021-02-08 03:56

I hope this could help you :)

Get the font-family:

if isinstance(c, pdfminer.layout.LTChar):
    print (c.fontname)

Get the font-size:

if isinstance(c, pdfminer.layout.LTChar):
    print (c.size)

Get the font-positon:

if isinstance(c, pdfminer.layout.LTChar):
    print (c.bbox)

Get the info of image:

if isinstance(obj, pdfminer.layout.LTImage):
outputImg = "<Image>\n"
outputImg += ("name: %s, " % obj.name)
outputImg += ("x: %f, " % obj.bbox[0])
outputImg += ("y: %f\n" % obj.bbox[1])
outputImg += ("width1: %f, " % obj.width)
outputImg += ("height1: %f, " % obj.height)
outputImg += ("width2: %f, " % obj.stream.attrs['Width'])
outputImg += ("height2: %f\n" % obj.stream.attrs['Height'])
print (outputImg)

0 讨论(0)

孤街浪徒

2021-02-08 03:56

If you want to get the font size or font name from a PDF file using PDF miner library you have to interpret the whole pdf page. You should decide for which word, phrase do you want to get font size and font name(as on a page you can have multiple words with different font sizes). The structure using PDF miner for a page: PDFPageInterpreter -> LTTextBox -> LTChar Once you found out for which word you want to get font size you call: size method for font size(which actually is height), and fontname for font. Code should look like this, you pass the pdf file path, word for which you want to get font size and the page number(on which page is the searched word):

def get_fontsize_and_fontname_for_word(self, pdf_path, word, page_number):
    resource_manager = PDFResourceManager()
    layout_params = LAParams()
    device = PDFPageAggregator(resource_manager, laparams=layout_params)
    pdf_file = file(pdf_path, 'rb')
    pdf_page_interpreter = PDFPageInterpreter(resource_manager, device)
    global actual_font_size_pt, actual_font_name

    for current_page_number, page in enumerate(PDFPage.get_pages(pdf_file)):
        if current_page_number == int(page_number) - 1:
            pdf_page_interpreter.process_page(page)
            layout = device.get_result()
            for textbox_element in layout:
                if isinstance(textbox_element, LTTextBox):
                    for line in textbox_element:
                        word_from_textbox = line.get_text().strip()
                        if word in word_from_textbox:
                            for char in line:
                                if isinstance(char, LTChar):
                                    # convert pixels to points
                                    actual_font_size_pt = int(char.size) * 72 / 96
                                    # remove prefixed font name, such as QTBAAA+
                                    actual_font_name = char.fontname[7:]
    pdf_file.close()
    device.close()
    return actual_font_size_pt, actual_font_name

You could check what other properties LTChar class supports

0 讨论(0)

刺人心

2021-02-08 03:57

#!/usr/bin/env python
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer


def createPDFDoc(fpath):
    fp = open(fpath, 'rb')
    parser = PDFParser(fp)
    document = PDFDocument(parser, password='')
    # Check if the document allows text extraction. If not, abort.
    if not document.is_extractable:
        raise "Not extractable"
    else:
        return document


def createDeviceInterpreter():
    rsrcmgr = PDFResourceManager()
    laparams = LAParams()
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    return device, interpreter


def parse_obj(objs):
    for obj in objs:
        if isinstance(obj, pdfminer.layout.LTTextBox):
            for o in obj._objs:
                if isinstance(o,pdfminer.layout.LTTextLine):
                    text=o.get_text()
                    if text.strip():
                        for c in  o._objs:
                            if isinstance(c, pdfminer.layout.LTChar):
                                print "fontname %s"%c.fontname
        # if it's a container, recurse
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)
        else:
            pass


document=createPDFDoc("/tmp/simple.pdf")
device,interpreter=createDeviceInterpreter()
pages=PDFPage.create_pages(document)
interpreter.process_page(pages.next())
layout = device.get_result()


parse_obj(layout._objs)

0 讨论(0)

有刺的猬

2021-02-08 04:05
This approach does not use PDFMiner but does the trick.

First, convert the PDF document into docx. Using python-docx you can then retrieve font information. Here's an example of getting all the bold text
```
from docx import *

document = Document('/path/to/file.docx')

for para in document.paragraphs:
    for run in para.runs:
        if run.bold:
            print run.text
```
If you really want to use PDFMiner you can try this. Passing '-t' would convert the PDF into HTML with all the font information.
0 讨论(0)
发布评论:

提交评论
- 加载中...