Extracting table contents from a collection of PDF files [closed]

江枫思渺然 提交于 2019-11-28 03:12:46
  1. The PDF format from its inception (more than 20 years ago) never was intended to be host of extractable, meaningfully structured data.

  2. Its purpose was to be a reliable visual representation of text, images and diagrams in a document -- a kind of digital paper (that would also reliably be transferred to real paper via printing). Only later in its development more features were added, which should help in extracting data again (google for Tagged PDF).

  3. For some examples of problems which are posed when data scraping tables from PDFs, see this article:

  4. Contradicting my point '1.' above, now I say this: for an amazing family of tools that gets better and better from week to week for extracting tabular data from PDFs (unless they are scanned pages), see these links:

So: go look for Tabula. If any tools can do what you want, at this time Tabula is probably amongst the best for the job!


Update

I've recently created an ASCiinema screencast demonstrating the use of the Tabula command line interface to extract a big table from a PDF as CSV:

(Click on image above to see it running. If it runs too fast for you to read all text, make use of the "Pause" button (||-symbol).)

It is hosted here:

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!