Extract heading as “key” and content as “value” and store it as dictionary in python from PDF

问题

I want to extract the heading as "key" and the content below it as "value" and store it as dictionary using python from a PDF file.

I have tried converting the PDF to html and getting the font name of heading and content and storing it as dictionary but it does not give the expected output. Also I have tried getting the co-ordinates of the text, still does not help.

for data in soup.select('span'):
    print("--",data)
    if "b'TrebuchetMS-Bold' "in str(data):
        if key != "":
            final_json[key] = value
        key = ""
        value = ""
        #print("++",data.contents)
        for d in data.contents:
            if str(d) == "<br/>":
                pass
            else:
                key = key + str(d)
        key = key.strip()
        print("***key",key)
    elif "b'TimesNewRomanPSMT'" in str(data) and key!=""  :

        for d in data.contents:
            if str(d) == "<br/>":
                pass
            else:
                value = value + str(d)
        print("value",value)

来源：https://stackoverflow.com/questions/57903701/extract-heading-as-key-and-content-as-value-and-store-it-as-dictionary-in-py

标签

python

dictionary

pdf-extraction

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!