Extract table from a PDF

前端 未结 1 1784
北恋
北恋 2021-02-11 02:04

I am trying to extract a table from a pdf document

I tried the route of pdf -> html -> extract table. The pdf that I mentioned above when converted to html produces gar

相关标签:
1条回答
  • 2021-02-11 02:31

    The PDF does not contain explicit table data. It only contains lines and character glyphs which we tend to interpret as tables. Thus your task involves putting our human table recognition capabilities into code which is quite a task.

    Generally speaking, if you are sure enough future PDFs will be generated by the same software in a very similar manner, it might be worth the time to investigate the file for some easy to follow hints to recognize the contents of individual fields.

    Your specific document, though, has an additional shortcoming: It does not contain the required information for direct text extraction! You can try copying & pasting from Adobe Reader and you'll get (at least I do) semi-random characters from the WinAnsi range.

    This is due to the fact that all fonts in the document claim that they use WinAnsiEncoding even though the characters referenced this way definitively are not from the WinAnsi character selection.

    Thus reliable text extraction from your document without OCR is impossible after all!

    (Trying copy&paste from Adobe Reader generally is a good first test whether text extraction is feasible at all; the text extraction methods of the Reader have been developed for many many years and, therefore, have become quite good. If you cannot extract anything sensible with Acrobat Reader, text extraction will be a very difficult task indeed.)

    0 讨论(0)
提交回复
热议问题