wkhtmltopdf generates a different checksum on every run

问题

I'm trying to verify that the content generated from wkhtmltopdf is the same from run to run, however every time I run wkhtmltopdf I get a different hash / checksum value against the same page. We are talking something real basic like using an html page of:

<html>
<body>
<p> This is some text</p>
</body
</html>

I get a different md5 or sha256 hash every time I run wkhtmltopdf using an amazing line of:

./wkhtmltopdf example.html ~/Documents/a.pdf

And using a python hasher of:

def shasum(filename):
    sha = hashlib.sha256()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(128*sha.block_size), b''): 
            sha.update(chunk)
    return sha.hexdigest()

or the md5 version which just swaps sha256 with md5

Why would wkhtmltopdf generate a different file enough to cause a different checksum, and is there any way to not do that? some command line that can be passed in to prevent this?

I've tried --default-header, --no-pdf-compression and --disable-smart-shrinking

This is on a MAC osx but I've generated these pdf's on other machines and downloaded them with the same result.

wkhtmltopdf version = 0.10.0 rc2

回答1:

I tried this and opened the resulting PDF in emacs. wkhtmltopdf is embedding a "/CreationDate" field in the PDF. It will be different for every run, and will screw up the hash values between runs.

I didn't see an option to disable the "/CreationDate" field, but it would be simple to strip it out of the file before computing the hash.

回答2:

I wrote a method to copy the creation date from the expected output to the current generated file. It's in Ruby and the arguments are any class that walk and quack like IO:

def copy_wkhtmltopdf_creation_date(to, from)
  to_current_pos, from_current_pos = [to.pos, from.pos]
  to.pos = from.pos = 74
  to.write(from.read(14))
  to.pos, from.pos = [to_current_pos, from_current_pos]
end

回答3:

I was inspired by Carlos to write a solution that doesn't use a hardcoded index, since in my documents the index differed from Carlos' 74.

Also, I don't have the files open already. And I handle the case of returning early when no CreationDate is found.

def copy_wkhtmltopdf_creation_date(to, from)
  index, date = File.foreach(from).reduce(0) do |acc, line|
    if line.index("CreationDate")
      break [acc + line.index(/\d{14}/), $~[0]]
    else
      acc + line.bytesize
    end
  end

  if date # IE, yes this is a wkhtmltopdf document
    File.open(to, "r+") do |to|
      to.pos = index
      to.write(date)
    end
  end
end

回答4:

We solved the problem by stripping the creation date with a simple regex.

preg_replace("/\\/CreationDate \\(D:.*\\)\\n/uim", "", $file_contents, 1);

After doing this we can get a consistent checksum every time.

来源：https://stackoverflow.com/questions/18723374/wkhtmltopdf-generates-a-different-checksum-on-every-run

标签

python

python-2.7

md5

checksum

wkhtmltopdf