Scraping large pdf tables which span across multiple pages

后端未结

关注

 7  1891

野的像风 2021-02-04 07:14

I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is t

7条回答

一向 (楼主)

2021-02-04 07:23

If you're wary of diving too deeply into Python or other code-based solutions, a completely different approach for a quick and dirty solution for a small number of pdfs is to outsource the task to MechanicalTurk.

Having multiple users per column allows you to double-check the submitted answers, and you can also publish the resulting .csv table and pay a large amount (say, $5) for every error that a worker can find. Often ends up being way cheaper than your or others' time programming a solution.

0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...