Extract text of PDF without tool

你说的曾经没有我的故事 提交于 2020-01-25 10:18:06

问题


Currently I'm extracting the text of PDF's with the itextsharp tool (in VB.net). I'd like to be independent of other tools / libraries as I can't give them to others along my programm.

Is there a solution (no .dll etc) in any programming language to quickly extract the text of a PDF?


回答1:


Short answer:

Of course there is a way of doing this. iText (alongside many other PDF libraries) are capable of doing it. So there is an algorithm for extracting text.

Long answer:

PDF is not a WYSIWYG format. A PDF document is sort of an ungodly marriage between "objects that reference eachother" and "programming language".

Let me explain. A PDF document has a graphics state. So whenever you see text in a PDF document (in a viewer like Adobe Reader), you are essentially seeing the result of some 'code' in the PDF document that says

Go to position 50, 720
Set the active font to Helvetica, fontsize 12
Set the active drawing color to black
draw the glyph that corresponds to the character 'H'
Go to position 53, 720
draw the glyph that corresponds to the character 'e'
etc

Instructions and resources (like fonts, images, vector graphics) can be grouped together in objects.

Each object is assigned a number, and is mentioned explictly in the cross-reference table (at the end of the PDF document).

So, in order to read the text from a PDF document you would need to:

  1. read the XREF table
  2. figure out where (byte location) the \page objects start
  3. parse the \page object and all its sub objects (again using the XREF table to figure out where in the file each of these sub objects are)
  4. parse geometrical instructions (the graphics state does not need to flow in the same direction as the text)
  5. sort all visible characters (comparing background and foreground color, occlusion by other objects such as images, etc) according to the direction you expect the text to be written in
  6. build the return string

And that is probably why other people use libraries. Don't get me wrong, I'm a huge fan of doing it yourself (it's the best way to gain a deep knowledge on how certain things work).

But look at it from the point of view of one of your users. What would you trust more?

  • A program that uses 'self written' code to handle PDF documents (total experience in parsing PDF documents < 1 year),
  • or a program that simply calls a PDF library (total experience in parsing PDF documents > 20 years)


来源:https://stackoverflow.com/questions/54407080/extract-text-of-pdf-without-tool

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!