PDFminer: extract text with its font information

后端 未结 6 1186
伪装坚强ぢ
伪装坚强ぢ 2021-02-08 03:26

I find this question, but it uses command line, and I do not want to call a Python script in command line using subprocess and parse HTML files to get the font information.

6条回答
  •  旧巷少年郎
    2021-02-08 03:56

    I hope this could help you :)

    Get the font-family:

    if isinstance(c, pdfminer.layout.LTChar):
        print (c.fontname)
    

    Get the font-size:

    if isinstance(c, pdfminer.layout.LTChar):
        print (c.size)
    

    Get the font-positon:

    if isinstance(c, pdfminer.layout.LTChar):
        print (c.bbox)
    

    Get the info of image:

    if isinstance(obj, pdfminer.layout.LTImage):
    outputImg = "\n"
    outputImg += ("name: %s, " % obj.name)
    outputImg += ("x: %f, " % obj.bbox[0])
    outputImg += ("y: %f\n" % obj.bbox[1])
    outputImg += ("width1: %f, " % obj.width)
    outputImg += ("height1: %f, " % obj.height)
    outputImg += ("width2: %f, " % obj.stream.attrs['Width'])
    outputImg += ("height2: %f\n" % obj.stream.attrs['Height'])
    print (outputImg)
    

提交回复
热议问题