I have been trying for a while to use the PoDoFo C++ library to extract text and lines (with their respective coordinates). But I have no way to do this.
This is what I
Use the PoDoFo tools "podofotxtextract" it gives you x,y coordinate (tool folder of PoDoFo package). Extract text from Pdf.
This answer will show you how to extract the text.
To get text positioning information, you will also have to process the following commands:
Tc
, Tw
, Tz
, TL
, T*
, Tr
and Tm
.
You definitely need to download the PDF spec from Adobe to get all the details. There is a chapter devoted entirely to text processing. It is well worth your time to print out that chapter as you will be referring to it a lot. Everything you need to know is in there, but it's not always obvious.
You will also need to use a bit of Linear Algebra. Nothing too complicated, though.
Since there are many ways to achieve the same results, it is important to implement all the commands thoroughly, even if the documents you are going to process might not seem to need certain features. For example: I ran across a document which set all text sizes to one point, which threw off all my calculations until I realized it was using the text scaling factor to set the actual font sizes.