Fastest way to check that a PDF is corrupted (Or just missing EOF) in Ruby?

点点圈 提交于 2019-12-01 11:16:52

TL;DR

Looking for %%EOF, with or without related structures, is relatively speedy even if you scan the entirety of a reasonably-sized PDF file. However, you can gain a speed boost if you restrict your search to the last kilobyte, or the last 6 or 7 bytes if you simply want to validate that %%EOF\n is the only thing on the last line of a PDF file.

Note that only a full parse of the PDF file can tell you if the file is corrupted, and only a full parse of the File Trailer can fully validate the trailer's conformance to standards. However, I provide two approximations below that are reasonably accurate and relatively fast in the general case.

Check Last Kilobyte for File Trailer

This option is fairly fast, since it only looks at the tail of the file, and uses a string comparison rather than a regular expression match. According to Adobe:

Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file.

Therefore, the following will work by looking for the file trailer instruction within that range:

def valid_file_trailer? filename
  File.open filename { |f| f.seek -1024, :END; f.read.include? '%%EOF' }
end

A Stricter Check of the File Trailer via Regex

However, the ISO standard is both more complex and a lot more strict. It says, in part:

The last line of the file shall contain only the end-of-file marker, %%EOF. The two preceding lines shall contain, one per line and in order, the keyword startxref and the byte offset in the decoded stream from the beginning of the file to the beginning of the xref keyword in the last cross-reference section. The startxref line shall be preceded by the trailer dictionary, consisting of the keyword trailer followed by a series of key-value pairs enclosed in double angle brackets (<< … >>) (using LESS-THAN SIGNs (3Ch) and GREATER-THAN SIGNs (3Eh)).

Without actually parsing the PDF, you won't be able to validate this with perfect accuracy using regular expressions, but you can get close. For example:

def valid_file_trailer? filename
  pattern = /^startxref\n\d+\n%%EOF\n\z/m
  File.open(filename) { |f| !!(f.read.scrub =~ pattern) }
end
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!