We get a large amount of data from our clients in pdf files in varying formats [layout-wise], these files are typically report output, and are typically properly annotated [
We use Xpdf in one of our applications. Its a c++ library which is primarily used for pdf rendering, although it does have a text extractor which could be useful for this project.