Fastest way to check that a PDF is corrupted (Or just missing EOF) in Ruby?

前端 未结 1 990
鱼传尺愫
鱼传尺愫 2021-01-14 13:37

I am looking for a way to check if a PDF is missing an end of file character. So far I have found I can use the pdf-reader gem and catch the MalformedPDFError exception, or

相关标签:
1条回答
  • 2021-01-14 14:03

    TL;DR

    Looking for %%EOF, with or without related structures, is relatively speedy even if you scan the entirety of a reasonably-sized PDF file. However, you can gain a speed boost if you restrict your search to the last kilobyte, or the last 6 or 7 bytes if you simply want to validate that %%EOF\n is the only thing on the last line of a PDF file.

    Note that only a full parse of the PDF file can tell you if the file is corrupted, and only a full parse of the File Trailer can fully validate the trailer's conformance to standards. However, I provide two approximations below that are reasonably accurate and relatively fast in the general case.

    Check Last Kilobyte for File Trailer

    This option is fairly fast, since it only looks at the tail of the file, and uses a string comparison rather than a regular expression match. According to Adobe:

    Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.

    Therefore, the following will work by looking for the file trailer instruction within that range:

    def valid_file_trailer? filename
      File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
    end
    

    A Stricter Check of the File Trailer via Regex

    However, the ISO standard is both more complex and a lot more strict. It says, in part:

    The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)).

    Without actually parsing the PDF, you won't be able to validate this with perfect accuracy using regular expressions, but you can get close. For example:

    def valid_file_trailer? filename
      pattern = /^startxref\n\d+\n%%EOF\n\z/m
      File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
    end
    
    0 讨论(0)
提交回复
热议问题