问题
I am writing a pdf files with embedded subset fonts. As required, I am including the ToUnicode and CIDSet objects. To test, I created a simple PDF with two Hebrew characters. I can select the two characters and copy to the clipboard, and paste it properly into another application such as Word. But I am not able to search for a word containing these two characters. Adobe Reader (or Acrobat) displays the message that the word was not found. So in essence, I have created a PDF document which can be copied properly, but is not searchable. Any idea what I might be missing when creating the document?
Additional information: 1. The file in question is a minimal file with just two characters. I have tested with many such files in many different languages including English. None of the files are searchable. 2. Curiously, if I search for the letter 'e', Adobe reader highlights an incorrect word, even if the letter 'e' does not exists in the file. 3. Adobe acrobat is also not able to search within this file, however when I save the file to another disk file, the saved file now is searchable. I confirmed that the major objects such as the font-file, ToUnicode object, CID object, and the font description objects are the same in the saved file. However, one of the font object is brought up closer to the top of the file. 4. FoxIt is able to search these files properly.
Relevant PDF objects:
5 0 obj
<>
stream
q 0.750000 0 0 0.750000 0.000000 792.000000 cm
q q q 0.160000 0.000000 0.000000 0.160000 0.000000 0.000000 cm
BT /F0 100.000000 Tf 0 g 750.000000 -690 Td[<02B0>] TJ 35.000000 0 Td[<02B9>] TJ ET Q
Q
Q
Q
endstream
endobj
10 0 obj
<>
endobj
11 0 obj
<> /FontDescriptor 10 0 R/Subtype/CIDFontType2/Type/Font>>
endobj
12 0 obj
<>
endobj
8 0 obj
<>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
3 beginbfchar
<0000> <0000>
<02B0> <05E0>
<02B9> <05E9>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
endobj
回答1:
In short
The problem is due to identical PDF IDs used for different documents.
In detail
Adobe Reader / Acrobat seem to cache search information for documents identifying the document by its ID. Some of the OP's documents seem to have the same ID, at least the two sample files do:
/ID[<754DC77D28E62763C4916970D595A10F><754DC77D28E62763C4916970D595A10F>]
Thus, search information from earlier viewed PDFs with that ID was used when the OP tried to search his test.pdf. Considering this description from one of his comments:
What happens if you search for the English letter 'e'. For me, the two Hebrew letters can selected. The same happens when I search for one of these English letters: d, i, n, o, p, r, t, y, I, N, R, T and Y.
the search information seems to have been cached for a document with Latin glyphs, Furthermore, considering this comment on test_en.pdf (a document sharing the same ID, too):
It has one English line: 'This is a test line'. When I search for "This', I find it. But I can not find the other words.
the text of the original document seems to have started with "This" but continued differently.
来源:https://stackoverflow.com/questions/28651280/why-a-pdf-document-with-embedded-fonts-can-be-copied-but-is-not-searchable-in-pd