If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

六眼飞鱼酱① 提交于 2019-11-27 17:38:23

I once wrote an algorithm that did exactly what you mentioned for a PDF editor product that is still the number one PDF editor used today. There are a couple of reasons for what you mention (I think) but the important one is focus.

You are correct that PDF (usually) doesn't contain any structure information. PDF is interested in the visual representation of a page, not necessarily in what the page "means". This means in its purest form it doesn't need information about lines, paragraphs, columns or anything like that. Actually, it doesn't even need information about the text itself and there are plenty of PDF files where you can't even copy and paste the text without ending up with gibberish.

So if you want to be able to extract formatted text, you have to indeed look at all of the pieces of text on the page, perhaps taking some of the line-art information into account as well, and you have to piece them back together. Usually that happens by writing an engine that looks at white-space and then decides first what are lines, what are paragraphs and so on. Tables are notoriously difficult for example because they are so diverse.

Alternative strategies could be to:

  • Look at some of the structure information that is available in some PDF files. Some PDF/A files and all PDF/UA files (PDF for archival and PDF for Universal Accessibility) must have structure information that can very well be used to retrieve structure. Other PDF files may have that information as well.
  • Look at the creator of the PDF document and have specific algorithms to handle those PDFs well. If you know you're only interested in Word or if you know that 99% of the PDFs you will ever handle will come out of Word 2011, it might be worth using that knowledge.

So why are some products better at this than others? Focus I guess. The PDF specification is very broad, and some tools focus more on lower-level PDF tasks, some more on higher-level PDF tasks. Some are oriented towards "office" use - some towards "graphic arts" use. Depending on your focus you may decide that a certain feature is worth a lot of attention or not.

Additionally, and that may seem like a lousy answer, but I believe it's actually true, this is an algorithmically difficult problem and it takes only one genius developer to implement an algorithm that is much better than the average product on the market. It's one of those areas where - if you are clever and you have enough focus to put some of your attention on it, and especially if you have a good idea what the target market is you are writing this for - you'll get it right, while everybody else will get it mediocre.

(And no, I didn't get it right back then when I was writing that code - we never had enough focus to follow-through and make something that was really good)

To properly extract formatted text a library/utility should:

  1. Retrieve correct information about properties of the fonts used in the PDF (glyph sizes, hinting information etc.)
  2. Maintain graphics state (i.e. non-font parameters like text and page scaling etc.)
  3. Implement some algorithm to decide which symbols on a page should be treated like words, lines or columns.

I am not really an expert in products you mentioned in your question, so the following conclusions should be taken with a grain of salt.

The tools that do not draw PDFs tend to have less expertise in the first two requirements. They have not have to deal with font details on a deeper level and they might not be that well tested in maintaining graphics state.

Any decent tool that translates PDFs to images will probably become aware of its shortcomings in text positioning sooner or later. And fixing those will help to excel in text extraction.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!