问题
I want to extract the heading as "key" and the content below it as "value" and store it as dictionary using python from a PDF file.
I have tried converting the PDF to html and getting the font name of heading and content and storing it as dictionary but it does not give the expected output. Also I have tried getting the co-ordinates of the text, still does not help.
for data in soup.select('span'):
print("--",data)
if "b'TrebuchetMS-Bold' "in str(data):
if key != "":
final_json[key] = value
key = ""
value = ""
#print("++",data.contents)
for d in data.contents:
if str(d) == "<br/>":
pass
else:
key = key + str(d)
key = key.strip()
print("***key",key)
elif "b'TimesNewRomanPSMT'" in str(data) and key!="" :
for d in data.contents:
if str(d) == "<br/>":
pass
else:
value = value + str(d)
print("value",value)
来源:https://stackoverflow.com/questions/57903701/extract-heading-as-key-and-content-as-value-and-store-it-as-dictionary-in-py