PDFminer: extract text with its font information

后端 未结 6 1188
伪装坚强ぢ
伪装坚强ぢ 2021-02-08 03:26

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information.

6条回答
  •  孤街浪徒
    2021-02-08 03:56

    If you want to get the font size or font name from a PDF file using PDF miner library you have to interpret the whole pdf page. You should decide for which word, phrase do you want to get font size and font name(as on a page you can have multiple words with different font sizes). The structure using PDF miner for a page: PDFPageInterpreter -> LTTextBox -> LTChar Once you found out for which word you want to get font size you call: size method for font size(which actually is height), and fontname for font. Code should look like this, you pass the pdf file path, word for which you want to get font size and the page number(on which page is the searched word):

    def get_fontsize_and_fontname_for_word(self, pdf_path, word, page_number):
        resource_manager = PDFResourceManager()
        layout_params = LAParams()
        device = PDFPageAggregator(resource_manager, laparams=layout_params)
        pdf_file = file(pdf_path, 'rb')
        pdf_page_interpreter = PDFPageInterpreter(resource_manager, device)
        global actual_font_size_pt, actual_font_name
    
        for current_page_number, page in enumerate(PDFPage.get_pages(pdf_file)):
            if current_page_number == int(page_number) - 1:
                pdf_page_interpreter.process_page(page)
                layout = device.get_result()
                for textbox_element in layout:
                    if isinstance(textbox_element, LTTextBox):
                        for line in textbox_element:
                            word_from_textbox = line.get_text().strip()
                            if word in word_from_textbox:
                                for char in line:
                                    if isinstance(char, LTChar):
                                        # convert pixels to points
                                        actual_font_size_pt = int(char.size) * 72 / 96
                                        # remove prefixed font name, such as QTBAAA+
                                        actual_font_name = char.fontname[7:]
        pdf_file.close()
        device.close()
        return actual_font_size_pt, actual_font_name
    

    You could check what other properties LTChar class supports

提交回复
热议问题