Scraping large pdf tables which span across multiple pages

后端未结

关注

 7  1879

野的像风 2021-02-04 07:14

I am trying to scrape PDF tables which span across multiple pages. I tried many things but the best seems to be pdftotext -layout as advised here. The problem is t

7条回答

粉色の甜心 (楼主)

2021-02-04 07:21

Although the layout differs across pages when using pdftotext, note that the column headings on individual pages (COMARCA, CODI, etc) seem to line up with the data on that page.

Also, there are many different types of data in your pdf - wind direction, wind strength, humidity, precipitation, etc. So not only does the layout differ across pages for the same data, but the layout differs because there are different data sets as well.

And just for completeness - the missing data for "Solsonès" (as one example) exists in the original PDF. It seems like pdftotext did a reasonable job - the missing data is whitespace, just like in the original PDF.

As a result, it may make sense to stay with pdftotext and treat the pages (which are separated by form feeds) as columnar data and parse using struct as documented here:

How to efficiently parse fixed width files?

One way to make this work would be to detect the form feed, look for the next line starting with "COMARCA", and use the spacing in that line to set up the columns for struct.

0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...