I am looking for opinions of how to handle large binary files on which my source code (web application) is dependent. We are currently discussing several alternatives:
In my opinion, if you're likely to often modify those large files, or if you intend to make a lot of git clone
or git checkout
, then you should seriously consider using another Git repository (or maybe another way to access those files).
But if you work like we do, and if your binary files are not often modified, then the first clone/checkout will be long, but after that it should be as fast as you want (considering your users keep using the first cloned repository they had).
Have a look at git bup which is a Git extension to smartly store large binaries in a Git repository.
You'd want to have it as a submodule, but you won't have to worry about the repository getting hard to handle. One of their sample use cases is storing VM images in Git.
I haven't actually seen better compression rates, but my repositories don't have really large binaries in them.
Your mileage may vary.
git clone --filter
from Git 2.19 + shallow clones
This new option might eventually become the final solution to the binary file problem, if the Git and GitHub devs and make it user friendly enough (which they arguably still haven't achieved for submodules for example).
It allows to actually only fetch files and directories that you want for the server, and was introduced together with a remote protocol extension.
With this, we could first do a shallow clone, and then automate which blobs to fetch with the build system for each type of build.
There is even already a --filter=blob:limit<size>
which allows limiting the maximum blob size to fetch.
I have provided a minimal detailed example of how the feature looks like at: How do I clone a subdirectory only of a Git repository?
You can also use git-fat. I like that it only depends on stock Python and rsync. It also supports the usual Git workflow, with the following self explanatory commands:
git fat init
git fat push
git fat pull
In addition, you need to check in a .gitfat file into your repository and modify your .gitattributes to specify the file extensions you want git fat
to manage.
You add a binary using the normal git add
, which in turn invokes git fat
based on your gitattributes rules.
Finally, it has the advantage that the location where your binaries are actually stored can be shared across repositories and users and supports anything rsync
does.
UPDATE: Do not use git-fat if you're using a Git-SVN bridge. It will end up removing the binary files from your Subversion repository. However, if you're using a pure Git repository, it works beautifully.
The solution I'd like to propose is based on orphan branches and a slight abuse of the tag mechanism, henceforth referred to as *Orphan Tags Binary Storage (OTABS)
TL;DR 12-01-2017 If you can use github's LFS or some other 3rd party, by all means you should. If you can't, then read on. Be warned, this solution is a hack and should be treated as such.
Desirable properties of OTABS
git pull
and git fetch
, including git fetch --all
are still bandwidth efficient, i.e. not all large binaries are pulled from the remote by default.Undesirable properties of OTABS
git clone
potentially inefficient (but not necessarily, depending on your usage). If you deploy this solution you might have to advice your colleagues to use git clone -b master --single-branch <url>
instead of git clone
. This is because git clone by default literally clones entire repository, including things you wouldn't normally want to waste your bandwidth on, like unreferenced commits. Taken from SO 4811434.git fetch <remote> --tags
bandwidth inefficient, but not necessarily storage inefficient. You can can always advise your colleagues not to use it. git gc
trick to clean your repository from any files you don't want any more.Adding the Binary Files
Before you start make sure that you've committed all your changes, your working tree is up to date and your index doesn't contain any uncommitted changes. It might be a good idea to push all your local branches to your remote (github etc.) in case any disaster should happen.
git checkout --orphan binaryStuff
will do the trick. This produces a branch that is entirely disconnected from any other branch, and the first commit you'll make in this branch will have no parent, which will make it a root commit.git rm --cached * .gitignore
.rm -fr * .gitignore
. Internal .git
directory will stay untouched, because the *
wildcard doesn't match it.git fetch
clogging their connection. You can avoid this by pushing a tag instead of a branch. This can still impact your colleague's bandwidth and filesystem storage if they have a habit of typing git fetch <remote> --tags
, but read on for a workaround. Go ahead and git tag 1.0.0bin
git push <remote> 1.0.0bin
.git branch -D binaryStuff
. Your commit will not be marked for garbage collection, because an orphan tag pointing on it 1.0.0bin
is enough to keep it alive.Checking out the Binary File
git checkout 1.0.0bin -- VeryBigBinary.exe
.1.0.0bin
downloaded, in which case you'll have to git fetch <remote> 1.0.0bin
beforehand.VeryBigBinary.exe
into your master's .gitignore
, so that no-one on your team will pollute the main history of the project with the binary by accident.Completely Deleting the Binary File
If you decide to completely purge VeryBigBinary.exe from your local repository, your remote repository and your colleague's repositories you can just:
git push <remote> :refs/tags/1.0.0bin
git tag -l | xargs git tag -d && git fetch --tags
. Taken from SO 1841341 with slight modification.git -c gc.reflogExpire=0 -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 -c gc.rerereunresolved=0 -c gc.pruneExpire=now gc "$@"
. It will also delete all other unreferenced commits. Taken from SO 1904860git clone -b master --single-branch <url>
instead of git clone
.2.0.0bin
. If you're worried about your colleagues typing git fetch <remote> --tags
you can actually name it again 1.0.0bin
. This will make sure that the next time they fetch all the tags the old 1.0.0bin
will be unreferenced and marked for subsequent garbage collection (using step 3). When you try to overwrite a tag on the remote you have to use -f
like this: git push -f <remote> <tagname>
Afterword
OTABS doesn't touch your master or any other source code/development branches. The commit hashes, all of the history, and small size of these branches is unaffected. If you've already bloated your source code history with binary files you'll have to clean it up as a separate piece of work. This script might be useful.
Confirmed to work on Windows with git-bash.
It is a good idea to apply a set of standard trics to make storage of binary files more efficient. Frequent running of git gc
(without any additional arguments) makes git optimise underlying storage of your files by using binary deltas. However, if your files are unlikely to stay similar from commit to commit you can switch off binary deltas altogether. Additionally, because it makes no sense to compress already compressed or encrypted files, like .zip, .jpg or .crypt, git allows you to switch off compression of the underlying storage. Unfortunately it's an all-or-nothing setting affecting your source code as well.
You might want to script up parts of OTABS to allow for quicker usage. In particular, scripting steps 2-3 from Completely Deleting Binary Files into an update
git hook could give a compelling but perhaps dangerous semantics to git fetch ("fetch and delete everything that is out of date").
You might want to skip the step 4 of Completely Deleting Binary Files to keep a full history of all binary changes on the remote at the cost of the central repository bloat. Local repositories will stay lean over time.
In Java world it is possible to combine this solution with maven --offline
to create a reproducible offline build stored entirely in your version control (it's easier with maven than with gradle). In Golang world it is feasible to build on this solution to manage your GOPATH instead of go get
. In python world it is possible to combine this with virtualenv to produce a self-contained development environment without relying on PyPi servers for every build from scratch.
If your binary files change very often, like build artifacts, it might be a good idea to script a solution which stores 5 most recent versions of the artifacts in the orphan tags monday_bin
, tuesday_bin
, ..., friday_bin
, and also an orphan tag for each release 1.7.8bin
2.0.0bin
, etc. You can rotate the weekday_bin
and delete old binaries daily. This way you get the best of two worlds: you keep the entire history of your source code but only the relevant history of your binary dependencies. It is also very easy to get the binary files for a given tag without getting entire source code with all its history: git init && git remote add <name> <url> && git fetch <name> <tag>
should do it for you.
I am looking for opinions of how to handle large binary files on which my source code (web application) is dependent. What are your experiences/thoughts regarding this?
I personally have run into synchronisation failures with Git with some of my cloud hosts once my web applications binary data notched above the 3 GB mark. I considered BFT Repo Cleaner at the time, but it felt like a hack. Since then I've begun to just keep files outside of Git purview, instead leveraging purpose-built tools such as Amazon S3 for managing files, versioning and back-up.
Does anybody have experience with multiple Git repositories and managing them in one project?
Yes. Hugo themes are primarily managed this way. It's a little kudgy, but it gets the job done.
My suggestion is to choose the right tool for the job. If it's for a company and you're managing your codeline on GitHub pay the money and use Git-LFS. Otherwise you could explore more creative options such as decentralized, encrypted file storage using blockchain.
Additional options to consider include Minio and s3cmd.