Copy+pasting text from PDF results in garbage

后端 未结 7 2469
无人及你
无人及你 2021-02-20 00:37

I am writing a Master\'s thesis - NLP system. I have one component - extractor.

It is extracting a plain text from PDF files. There are a few PDF files that can not be

7条回答
  •  情歌与酒
    2021-02-20 01:32

    Very often in such cases, where you can't select, copy'n'paste text from the Acrobat (Reader) window, there is another option which may work nevertheless:

    • Open 'File' menu,
    • select 'Save as...',
    • select 'Text (normal) (*.txt)',
    • browse to the target directory,
    • type the name you want to use for the text file.

    You'll have all text from all pages in the file and need to locate the spot you wanted to copy'n'paste initially -- insofar it is not as comfortable as direct copy'n'paste. But it works more reliably....

    It also works with acroread on Linux (but you have to choose 'Save as text...' from the file menu).

    Update

    You can use the pdffonts command line utility to get a quick-shot analysis of the fonts used by a PDF.

    Here is an example output, which demonstrates where a problem for text extraction will very likely occur. It uses one of these hand-coded PDF files from a GitHub-Repository which was created to provide PDF sample files which are well commented and may easily be opened in a text editor:

    $ pdffonts  textextract-bad2.pdf
      name                            type         encoding    emb sub uni object ID
      ------------------------------- ------------ ----------- --- --- --- ---------
      BAAAAA+Helvetica                TrueType     WinAnsi     yes yes yes     12  0
      CAAAAA+Helvetica-Bold           TrueType     WinAnsi     yes yes no      13  0
    

    How to interpret this table?

    • The above PDF file uses two subsetted fonts (as indicated by the BAAAAA+ and CAAAAA+ prefixes to their names, as well as by the yes entries in the sub column), Helvetica and Helvtica-Bold.
    • Both fonts are of type TrueType.
    • Both fonts use a WinAnsi encoding (a font encoding maps char identifiers used in the PDF source code to glyphs that should be drawn). However, only for font /Helvetica there is a /ToUnicode table available inside the PDF (for /Helvetica-Bold there is none), as indicated by the yes/no in the uni-column).

    The /ToUnicode table is required to provide a reverse mapping from character identifiers/codes to characters.

    A missing /ToUnicode table for a specific font is almost always a sure indicator that text strings using this font cannot be extracted or copied'n'pasted from the PDF. (Even if a /ToUnicode table is there, text extraction may still pose a problem, because this table may be damaged, incorrect or incomplete -- as seen in many real-world PDF files, and as also demonstrated by a few companion files in the above linked GitHub repository.)

提交回复
热议问题