Extract text from pdf using zlib

后端 未结 3 1136
野的像风
野的像风 2021-01-26 16:37

I am using that function to find a text in the pdf file and replace that text with another text. The problem is when I make inflate and then change the text and deflate, in the

相关标签:
3条回答
  • 2021-01-26 17:14

    i think you have to read about how the text is stored inside a PDF file,

    here is a link to spec http://www.adobe.com/devnet/pdf/pdf_reference.html

    section 9 Text is the key to understanding.

    0 讨论(0)
  • 2021-01-26 17:15

    Your assumption that text can be found literally in a content stream is wrong.

    Suppose that you have a PDF with content Hello World. Then you could have a stream that looks like this:

    q
    BT
    36 806 Td
    0 -18 Td
    /F1 12 Tf
    (Hello World!)Tj
    0 0 Td
    ET
    Q
    

    But it can also look like this:

    Q
    BT
    /F1 12 Tf
    88.66 367 Td
    (ld) Tj
    -22 0 Td
    (Wor) Tj
    -15.33 0 Td
    (llo) Tj
    -15.33 0 Td
    (He) Tj
    ET
    q
    

    Your code will detect the word "Hello" in the former stream, but will miss it in the latter one.

    A PDF viewer will render both streams in exactly the same way: you'll see "Hello World" at the exact same position.

    Sometimes Strings are broken into smaller pieces, you'll often find text arrays to introduce kerning, etc... This is all standard practice in PDF.

    PDF isn't a format that is suited for editing. I'm not saying it's impossible, but you're looking at a couple of weeks of extra programming if you want to meet your requirement of being able to replace one String with another one in a PDF stream.

    0 讨论(0)
  • 2021-01-26 17:24

    There are multiple issues in your code the effects of which are visible in the sample newpdf.pdf you provided in a comment to Bruno's answer:

    1. After you write your re-compressed stream to the output file, you add "\nendstr" and proceed the size of this string, 7, characters beyond the end of the source stream in the input buffer, most likely to prevent seeing the "stream" in "endstream" as the start of the next stream:

      [self saveFile:(char *)"\nendstr" len:7 fileName:[destFile cStringUsingEncoding:NSUTF8StringEncoding]];
      [...]
      buffer += streamend + 7;
      

      The issue in adding that string is that you assume that the "endstream" in the input buffer is preceded by exactly one NEWLINE (0x0A) byte. This assumption is wrong because

      a. in PDF there are three types of valid end-of-line markers, a single LINE FEED (0x0A), a single CARRIAGE RETURN (0x0D), or a CARRIAGE RETURN and LINE FEED pair (0x0D 0x0A), and any one of these end-of-line markers may precede the "endstream" in the input buffer; in the code further above where you calculate the end of the compressed stream, you ignore the single CARRIAGE RETURN variety, and here you ignore the 2 byte variety; and furthermore:

      b. the PDF specification does not even require but merely recommends to add an end-of-line between the end of the stream and the "endstream" keyword, cf. section 7.3.8.1:

      There should be an end-of-line marker after the data and before endstream

      This already breaks the first stream in your sample file in which the source file does not have an end-of-line marker there and your result, therefore, replaces the original "endstream" with a "\nendstram". This actually happens fairly often in your sample.

    2. You completely ignore that a PDF stream in its dictionary contains an entry containing the length of the stream, cf. section 7.3.8.2 in the PDF specification:

      Every stream dictionary shall have a Length entry that indicates how many bytes of the PDF file are used for the stream’s data.

      Your manipulation, even if you only decompress and recompress, is likely to change the length of the compressed stream. Thus, you have to update that Length entry. This admittedly makes your task somewhat more difficult as that dictionary is before the stream. Furthermore, in cases like your source file, that entry might even not directly contain the value but instead reference an indirect object somewhere else in the file.

      This breaks the second stream in your file which claims it is 8150 bytes long but instead is some 200 bytes longer. Any PDF viewer may assume the content of that stream in your file is only 8150 bytes long and, thus, ignore the contents of those trailing 200 bytes. This may very well be the reason why you observed that

      some text or graphics are missing.

    3. You completely ignore that a PDF has a cross reference table or stream (or possibly even a chain of them), cf. section 7.5.4 in the PDF specification:

      The cross-reference table contains information that permits random access to indirect objects within the file so that the entire file need not be read to locate any particular object. The table shall contain a one-line entry for each indirect object, specifying the byte offset of that object within the body of the file. (Beginning with PDF 1.5, some or all of the cross-reference information may alternatively be contained in cross-reference streams; see 7.5.8, "Cross-Reference Streams.")

      Your manipulation, even if you only decompress and recompress, is likely to change the length of the compressed stream. Thus, you have to update the offsets of all following objects in the cross reference table.

      As already the size of the second stream in your result file differs, only a very few cross reference entries in that file are correct.

    4. You assume that every PDF stream is deflated. This assumption is wrong, cf. table 5 in the PDF specification.

      Your code essentially drops all streams it cannot inflate. This may also be a reason why you observed that

      some text or graphics are missing.

    5. You assume that the sequence "stream" in a PDF unambiguously indicates the start of a stream. This is wrong, that sequence may easily be used in other contexts, too.

    6. You assume that the first sequence "endstream" in a PDF after the start of a stream unambiguously indicates the end of that stream. This is wrong, that sequence may also be part of the stream content. You have to use the value of the Length entry in the stream dictionary.

    Furthermore you seem to assume that every stream you come along still is used in the resulting PDF. This does not need to be the case. Especially in case of incremental updates (cf. section 7.5.6 in the PDF specification) there may be many objects in the file not in use anymore. While this does not necessarily break the syntax of the result file, your changes (if they depend on each other) are semantically incorrect.

    0 讨论(0)
提交回复
热议问题