Extract table data from PDF [closed]

↘锁芯ラ 提交于 2019-11-30 14:00:21

If the PDF document misses information that marks content as table, row, cell, etc. (known as tags), then there is no consistent way to extract tables from the PDF document. Mostly, PDF documents do not contain these tags. These tags typically serve to make a PDF accessible so that it can for example be read aloud. These tags are not required for a PDF to be valid.

What you could do however, is use the pdftotext -layout input.pdf output.txt. It prints the pdf in a text file and contains the original layout. There are no tags, but with a bit of nifty scripting (perl / php / whatever), you can recover the data from the tables.

If you're working on a single page, you're probably better off doing it manually, but if you (like me) have to work on 100's or 1000's of pages, it's about the best you can get. I've been looking around for a long time and can't find any better pdf-2-text tool than pdftotext.

There is a bit of inconsistency in the output, not all similar pdf tables produce a similar looking txt output, but that makes your scripting a little more interesting.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!