I have used pdftohtml
with the -xml
argument, read the result with subprocess.Popen()
, that will give you x coord, y coord, width, height, and font, of every snippet of text in the pdf. I think this is what 'evince' probably uses too because the same error messages spew out.
If you need to process columnar data, it gets slightly more complicated as you have to invent an algorithm that suits your pdf file. The problem is that the programs that make PDF files don't really necessarily lay out the text in any logical format. You can try simple sorting algorithms and it works sometimes, but there can be little 'stragglers' and 'strays', pieces of text that don't get put in the order you thought they would. So you have to get creative.
It took me about 5 hours to figure out one for the pdf's I was working on. But it works pretty good now. Good luck.