Why a pdf document with embedded fonts can be copied but is not searchable in pdf reader

末鹿安然 提交于 2019-12-11 12:08:49

问题


I am writing a pdf files with embedded subset fonts. As required, I am including the ToUnicode and CIDSet objects. To test, I created a simple PDF with two Hebrew characters. I can select the two characters and copy to the clipboard, and paste it properly into another application such as Word. But I am not able to search for a word containing these two characters. Adobe Reader (or Acrobat) displays the message that the word was not found. So in essence, I have created a PDF document which can be copied properly, but is not searchable. Any idea what I might be missing when creating the document?

Additional information: 1. The file in question is a minimal file with just two characters. I have tested with many such files in many different languages including English. None of the files are searchable. 2. Curiously, if I search for the letter 'e', Adobe reader highlights an incorrect word, even if the letter 'e' does not exists in the file. 3. Adobe acrobat is also not able to search within this file, however when I save the file to another disk file, the saved file now is searchable. I confirmed that the major objects such as the font-file, ToUnicode object, CID object, and the font description objects are the same in the saved file. However, one of the font object is brought up closer to the top of the file. 4. FoxIt is able to search these files properly.

Relevant PDF objects:

5 0 obj

<>

stream

    q 0.750000 0 0 0.750000 0.000000 792.000000 cm 

    q q q 0.160000 0.000000 0.000000 0.160000 0.000000 0.000000 cm 

    BT /F0 100.000000 Tf 0 g 750.000000 -690 Td[<02B0>] TJ 35.000000 0 Td[<02B9>] TJ ET Q

    Q 

    Q

    Q

endstream

endobj

10 0 obj

<>

endobj

11 0 obj

<> /FontDescriptor 10 0 R/Subtype/CIDFontType2/Type/Font>>

endobj

12 0 obj

<>

endobj

8 0 obj

<>

stream

    /CIDInit /ProcSet findresource begin

    12 dict begin

    begincmap

    /CIDSystemInfo

    << /Registry (Adobe)

    /Ordering (UCS) /Supplement 0 >> def

    /CMapName /Adobe-Identity-UCS def

    /CMapType 2 def

    1 begincodespacerange

    <0000> <FFFF>

    endcodespacerange

    3 beginbfchar

    <0000> <0000>

    <02B0> <05E0>

    <02B9> <05E9>

    endbfchar

    endcmap

    CMapName currentdict /CMap defineresource pop

    end

    end

endstream

endobj


回答1:


In short

The problem is due to identical PDF IDs used for different documents.

In detail

Adobe Reader / Acrobat seem to cache search information for documents identifying the document by its ID. Some of the OP's documents seem to have the same ID, at least the two sample files do:

/ID[<754DC77D28E62763C4916970D595A10F><754DC77D28E62763C4916970D595A10F>] 

Thus, search information from earlier viewed PDFs with that ID was used when the OP tried to search his test.pdf. Considering this description from one of his comments:

What happens if you search for the English letter 'e'. For me, the two Hebrew letters can selected. The same happens when I search for one of these English letters: d, i, n, o, p, r, t, y, I, N, R, T and Y.

the search information seems to have been cached for a document with Latin glyphs, Furthermore, considering this comment on test_en.pdf (a document sharing the same ID, too):

It has one English line: 'This is a test line'. When I search for "This', I find it. But I can not find the other words.

the text of the original document seems to have started with "This" but continued differently.



来源:https://stackoverflow.com/questions/28651280/why-a-pdf-document-with-embedded-fonts-can-be-copied-but-is-not-searchable-in-pd

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!