How to extract text from a PDF file?

前端 未结 24 2034
孤城傲影
孤城傲影 2020-11-22 14:05

I\'m trying to extract the text included in this PDF file using Python.

I\'m using the PyPDF2 module, and have the following script:

imp         


        
24条回答
  •  太阳男子
    2020-11-22 14:21

    How to extract text from a PDF file?

    The first thing to understand is the PDF format. It has a public specification written in English, see ISO 32000-2:2017 and read the more than 700 pages of PDF 1.7 specification. You certainly at least need to read the wikipedia page about PDF

    Once you understood the details of the PDF format, extracting text is more or less easy (but what about text appearing in figures or images; its figure 1)? Don't expect writing a perfect software text extractor alone in a few weeks....

    On Linux, you might also use pdf2text which you could popen from your Python code.

    In general, extracting text from a PDF file is an ill defined problem. For a human reader some text could be made (as a figure) from different dots, or a photo, etc...

    The Google search engine is capable of extracting text from PDF, but is rumored to need more than half a billion lines of source code. Do you have the necessary resources (in man power, in budget) to develop a competitor?

    A possibility might be to print the PDF to some virtual printer (e.g. using GhostScript or Firefox), then to use OCR techniques to extract text.

    I would recommend instead to work on the data representation which has generated that PDF file, for example on the original LaTeX code (or Lout code) or on OOXML code.

    In all cases, you need to budget at least several person years of software development.

提交回复
热议问题