pdf-extraction

If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

半城伤御伤魂 提交于 2019-11-26 22:34:54
问题 I have been trying to write a simple console application or PowerShell script to extract the text from a large number of PDF documents. There are several libraries and CLI tools that offer to do this, but it turns out that none are able to reliably identify document structure. In particular I am concerned with the recognition of text columns. Even the very expensive PDFLib TET tool frequently jumbles the content of two adjacent columns of text. It is frequently noted that the PDF format does