How to extract data from a PDF file while keeping track of its structure?

后端 未结 6 528
醉话见心
醉话见心 2020-12-12 22:52

My objective is to extract the text and images from a PDF file while parsing its structure. The scope for parsing the structure is not exhaustive; I only need to be able to

6条回答
  •  时光说笑
    2020-12-12 23:12

    There is essentially not an easy cut-and-paste solution because PDF isn't really very interested in structure. There are many other answers on this site that will tell you things in much more detail, but this one should give you the main points:

    If identifying text structure in PDF documents is so difficult, how do PDF readers do it so well?

    If you want to do this in PDF itself (where you would have the majority of control over the process), you'll have to loop over all text on pages and identify headers by looking at their text properties (fonts used, size relative to the other text on the page, etc...).

    On top of that you'll also have to identify paragraphs by looking at the positioning of text fragments, white space on the page, closeness of certain letters, words and lines... PDF by itself doesn't even have a concept for a "word", let alone "lines" or "paragraphs".

    To complicate things even more, the way text is drawn on the page (and thus the order in which it appears in the PDF file itself) doesn't even have to be the proper reading order (or what us humans would consider to be proper reading order).

提交回复
热议问题