发表新帖

发表新帖

Python module for converting PDF to text [closed]

后端未结

关注

 13  669

陌清茗 2020-11-22 08:59

13条回答

花落未央 (楼主)

2020-11-22 09:19

I have used pdftohtml with the -xml argument, read the result with subprocess.Popen(), that will give you x coord, y coord, width, height, and font, of every snippet of text in the pdf. I think this is what 'evince' probably uses too because the same error messages spew out.

If you need to process columnar data, it gets slightly more complicated as you have to invent an algorithm that suits your pdf file. The problem is that the programs that make PDF files don't really necessarily lay out the text in any logical format. You can try simple sorting algorithms and it works sometimes, but there can be little 'stragglers' and 'strays', pieces of text that don't get put in the order you thought they would. So you have to get creative.

It took me about 5 hours to figure out one for the pdf's I was working on. But it works pretty good now. Good luck.

0 讨论(0)

查看其它13个回答
发布评论:

提交评论
- 加载中...

热议问题