I am trying to extract data as HTML elements in python using pdfminer

半世苍凉 提交于 2021-02-11 13:33:59

问题


I am trying extract data as HTML from pdf using pdfminer although I was successful to extract text from the same pdf now I am getting an error while extracting data as HTML I have to filter the data further to categorize it in CSV. This is the script.

from io import StringIO  
from pdfminer.layout import LAParams  
from pdfminer.high_level import extract_text_to_fp  

output_string = StringIO  

with open('mini.pdf','rb') as fn:  
    extract_text_to_fp(fn, output_string, laparams=LAParams(), output_type='html', codec=None)

And this is the error I am getting. Click Here


回答1:


This code work for me.

def convert_html(filename):
    output = StringIO()
    with open(filename, 'rb') as fin:
        extract_text_to_fp(fin,output,laparams=LAParams(),output_type='html', 
             codec=None)
        Out_txt=output.getvalue()
        return Out_txt


来源:https://stackoverflow.com/questions/63523909/i-am-trying-to-extract-data-as-html-elements-in-python-using-pdfminer

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!