Scraping large pdf tables which span across multiple pages

后端 未结 7 1877
野的像风
野的像风 2021-02-04 07:14

I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is t

7条回答
  •  一向
    一向 (楼主)
    2021-02-04 07:23

    If you're wary of diving too deeply into Python or other code-based solutions, a completely different approach for a quick and dirty solution for a small number of pdfs is to outsource the task to MechanicalTurk.

    Having multiple users per column allows you to double-check the submitted answers, and you can also publish the resulting .csv table and pay a large amount (say, $5) for every error that a worker can find. Often ends up being way cheaper than your or others' time programming a solution.

提交回复
热议问题