According to this:
It is important to note that this is very different from most SCM systems that you may be familiar with. Subversion, CVS, Pe
No, commit objects in git don't contain diffs - instead, each commit object contains a hash of the tree, which recursively and completely defines the content of the source tree at that commit. There's a nice explanation in the git community book of what goes into blob objects, tree objects and commit objects .
All the diffs that are shown to you by git's tools are calculated on demand from the complete content of files.
What the statement means is that, most other version control systems need a point of reference in the past to be able to re-create the current commit.
For example, at some point in the past, a diff-based VCS (version control system) would have stored a full snapshot:
x = snapshot
+ = diff
History:
x-----+-----+-----+-----(+) Where we are now
So, in such a scenario, to re-create the state at (now), it would have to checkout (x) and then apply diffs for each (+) until it gets to now. Note that it would extremely inefficient to store deltas forever, so every so often, delta based VCSes store a full snapshot. Here's how its done for subversion.
Now, git is different. Git stores references to complete blobs and this means that with git, only one commit is sufficient to recreate the codebase at that point in time. Git does not need to look up information from past revisions to create a snapshot.
So if that is the case, then where does the delta compression that git uses come in?
Well, it is nothing but a compression concept - there is no point storing the same information twice, if only a tiny amount has changed. Therefore, represent what has changed, but store a reference to it, so that the commit that it belongs to, which is in effect a tree of references, can still be re-created without looking at past commits. The thing is, though, that Git does not do this immediately after every commit, but rather on a garbage collection run. So, if git has not run its garbage collection, you can see objects in your index with very similar content.
However, when Git runs its garbage collection (or when you call git gc
manually), then the duplicates are cleaned up and a read only pack file is created. You don't have to worry about running garbage collection manually - git contains heuristics which tell it when to do so.