问题
I'm trying to verify that the content generated from wkhtmltopdf is the same from run to run, however every time I run wkhtmltopdf I get a different hash / checksum value against the same page. We are talking something real basic like using an html page of:
<html>
<body>
<p> This is some text</p>
</body
</html>
I get a different md5 or sha256 hash every time I run wkhtmltopdf using an amazing line of:
./wkhtmltopdf example.html ~/Documents/a.pdf
And using a python hasher of:
def shasum(filename):
sha = hashlib.sha256()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(128*sha.block_size), b''):
sha.update(chunk)
return sha.hexdigest()
or the md5 version which just swaps sha256 with md5
Why would wkhtmltopdf generate a different file enough to cause a different checksum, and is there any way to not do that? some command line that can be passed in to prevent this?
I've tried --default-header, --no-pdf-compression and --disable-smart-shrinking
This is on a MAC osx but I've generated these pdf's on other machines and downloaded them with the same result.
wkhtmltopdf version = 0.10.0 rc2
回答1:
I tried this and opened the resulting PDF in emacs. wkhtmltopdf is embedding a "/CreationDate" field in the PDF. It will be different for every run, and will screw up the hash values between runs.
I didn't see an option to disable the "/CreationDate" field, but it would be simple to strip it out of the file before computing the hash.
回答2:
I wrote a method to copy the creation date from the expected output to the current generated file. It's in Ruby and the arguments are any class that walk and quack like IO:
def copy_wkhtmltopdf_creation_date(to, from)
to_current_pos, from_current_pos = [to.pos, from.pos]
to.pos = from.pos = 74
to.write(from.read(14))
to.pos, from.pos = [to_current_pos, from_current_pos]
end
回答3:
I was inspired by Carlos to write a solution that doesn't use a hardcoded index, since in my documents the index differed from Carlos' 74.
Also, I don't have the files open already. And I handle the case of returning early when no CreationDate
is found.
def copy_wkhtmltopdf_creation_date(to, from)
index, date = File.foreach(from).reduce(0) do |acc, line|
if line.index("CreationDate")
break [acc + line.index(/\d{14}/), $~[0]]
else
acc + line.bytesize
end
end
if date # IE, yes this is a wkhtmltopdf document
File.open(to, "r+") do |to|
to.pos = index
to.write(date)
end
end
end
回答4:
We solved the problem by stripping the creation date with a simple regex.
preg_replace("/\\/CreationDate \\(D:.*\\)\\n/uim", "", $file_contents, 1);
After doing this we can get a consistent checksum every time.
来源:https://stackoverflow.com/questions/18723374/wkhtmltopdf-generates-a-different-checksum-on-every-run