I have run into a strange problem with git and zip files. My build script takes a bunch of documentation html pages and zips them into a docs.zip I then check this file into
Use below script to create deterministic zip or jar files
#!/bin/bash
usage() {
echo "Usage : ./createDeterministicArtifact.sh <zip/jar file name>"
exit 1
}
info() {
echo "$1"
}
strip_artifact() {
if [ -z ${file} ]; then
usage
fi
if [ -f ${file} -a -s ${file} ]; then
mkdir -p ${file}.tmp
unzip -oq -d ${file}.tmp ${file}
find ${file}.tmp -follow -exec touch -a -m -t 201912010000.00 {} \+
if [ "$UNAME" == "Linux" ] ; then
find ${file}.tmp -follow -exec chattr -a {} \+
elif [[ "$UNAME" == CYGWIN* || "$UNAME" == MINGW* ]] ; then
find ${file}.tmp -follow -exec attrib -A {} \+
fi
cd ${file}.tmp
zip -rq -D -X -9 -A --compression-method deflate ../${file}.new .
cd -
rm -rf ${file}.tmp
info "Recreated deterministic artifact: ${file}.new"
else
info "Input file is empty. Please validate the file and try again"
fi
}
file=$1
strip_artifact
I had success on creating files with the same SHA1 using the -X
(--no-extra
) flag for zip
.
I created a folder and a couple of files to zip to test it, and as expected, getting different SHA1 hashes everytime:
$ mkdir stuff
$ echo "Stuff 1" > stuff/stuff1.txt
$ echo "Stuff 2" > stuff/stuff2.txt
$ zip -r stuff.zip stuff/
adding: stuff/ (stored 0%)
adding: stuff/stuff1.txt (stored 0%)
adding: stuff/stuff2.txt (stored 0%)
$ shasum stuff.zip
1c8be43ac859bb57603be1243da14022710d22bd stuff.zip
$ shasum stuff.zip
1c8be43ac859bb57603be1243da14022710d22bd stuff.zip
$ zip -r stuff.zip stuff/
updating: stuff/ (stored 0%)
updating: stuff/stuff1.txt (stored 0%)
updating: stuff/stuff2.txt (stored 0%)
$ shasum stuff.zip
73920362d0f7de74d87286502e03e7126fdc0a6a stuff.zip
However, using -X
gets me the same hash after consecutive zipping:
$ zip -r -X stuff.zip stuff/
updating: stuff/ (stored 0%)
updating: stuff/stuff1.txt (stored 0%)
updating: stuff/stuff2.txt (stored 0%)
$ shasum stuff.zip
1ed228b16d1ee803f26a8b1419f2eb3bf7fcb9f5 stuff.zip
$ zip -r -X stuff.zip stuff/
updating: stuff/ (stored 0%)
updating: stuff/stuff1.txt (stored 0%)
updating: stuff/stuff2.txt (stored 0%)
$ shasum stuff.zip
1ed228b16d1ee803f26a8b1419f2eb3bf7fcb9f5 stuff.zip
I don't have the time to dig in and find out which extra info is causing the difference to popup in the first case, but maybe this could be helpful to someone trying to solve it. Also only tested on macOS 10.12.6.
According to Wikipedia http://en.wikipedia.org/wiki/Zip_(file_format) seems that zip files have headers for File last modification time and File last modification date so any zip file checked into git will appear to git to have changed if the zip is rebuilt from the same content since. And it seems that there is no flag to tell it to not set those headers.
I am resorting to just using tar, it seems to produce the same bytes for the same input if run multiple times.
By default, gzip saves file name and time stamp
%> gzip -help 2>&1 | grep -e '-n'
-N --name save or restore original file name and time stamp
-n --no-name don't save original file name or time stamp
%> gzip -V
Apple gzip 272
Using -n option:
%> tar cv foo/ | gzip -n > foo.tgz; shasum foo.tgz # sha256sum on Ubuntu
you will consistently get the same hash.
Try above without -n and you should see a different hash each time.