I am looking for opinions of how to handle large binary files on which my source code (web application) is dependent. We are currently discussing several alternatives:
SVN seems to handle binary deltas more efficiently than Git.
I had to decide on a versioning system for documentation (JPEG files, PDF files, and .odt files). I just tested adding a JPEG file and rotating it 90 degrees four times (to check effectiveness of binary deltas). Git's repository grew 400%. SVN's repository grew by only 11%.
So it looks like SVN is much more efficient with binary files.
So my choice is Git for source code and SVN for binary files like documentation.
I would use submodules (as Pat Notz) or two distinct repositories. If you modify your binary files too often, then I would try to minimize the impact of the huge repository cleaning the history:
I had a very similar problem several months ago: ~21 GB of MP3 files, unclassified (bad names, bad id3's, don't know if I like that MP3 file or not...), and replicated on three computers.
I used an external hard disk drive with the main Git repository, and I cloned it into each computer. Then, I started to classify them in the habitual way (pushing, pulling, merging... deleting and renaming many times).
At the end, I had only ~6 GB of MP3 files and ~83 GB in the .git directory. I used git-write-tree
and git-commit-tree
to create a new commit, without commit ancestors, and started a new branch pointing to that commit. The "git log" for that branch only showed one commit.
Then, I deleted the old branch, kept only the new branch, deleted the ref-logs, and run "git prune": after that, my .git folders weighted only ~6 GB...
You could "purge" the huge repository from time to time in the same way: Your "git clone"'s will be faster.
If the program won't work without the files it seems like splitting them into a separate repo is a bad idea. We have large test suites that we break into a separate repo but those are truly "auxiliary" files.
However, you may be able to manage the files in a separate repo and then use git-submodule to pull them into your project in a sane way. So, you'd still have the full history of all your source but, as I understand it, you'd only have the one relevant revision of your images submodule. The git-submodule
facility should help you keep the correct version of the code in line with the correct version of the images.
Here's a good introduction to submodules from Git Book.
I discovered git-annex recently which I find awesome. It was designed for managing large files efficiently. I use it for my photo/music (etc.) collections. The development of git-annex is very active. The content of the files can be removed from the Git repository, only the tree hierarchy is tracked by Git (through symlinks). However, to get the content of the file, a second step is necessary after pulling/pushing, e.g.:
$ git annex add mybigfile
$ git commit -m'add mybigfile'
$ git push myremote
$ git annex copy --to myremote mybigfile ## This command copies the actual content to myremote
$ git annex drop mybigfile ## Remove content from local repo
...
$ git annex get mybigfile ## Retrieve the content
## or to specify the remote from which to get:
$ git annex copy --from myremote mybigfile
There are many commands available, and there is a great documentation on the website. A package is available on Debian.
Another solution, since April 2015 is Git Large File Storage (LFS) (by GitHub).
It uses git-lfs (see git-lfs.github.com) and tested with a server supporting it: lfs-test-server:
You can store metadata only in the git repo, and the large file elsewhere.
Have a look at camlistore. It is not really Git-based, but I find it more appropriate for what you have to do.