Could somebody explain how git knows internally that files X, Y and Z have changed? What is the process behind the scenes that recognizes when a file has not yet been added or h
You can find your answer in the free book Pro-Git on chapter Git Internals
This chapter explains how git works behind the hood.
As Leo stated, git checks the SHA1 of the files to see if it has changed you can check it like this (Taken from Git Internals):
$ echo 'version 1' > test.txt
$ git hash-object -w test.txt
83baae61804e65cc73a7201a7252750c76066a30
Then, write some new content to the file, and save it again:
$ echo 'version 2' > test.txt
$ git hash-object -w test.txt
1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
https://codewords.recurse.com/issues/two/git-from-the-inside-out
Git is built on a graph. Almost every Git command manipulates this graph. To understand Git deeply, focus on the properties of this graph, not workflows or commands.
The user sets the content of
data/number.txt
to2
. This updates the working copy, but leaves the index andHEAD
commit as they are.The user adds the file to Git. This adds a blob containing 2 to the objects directory. It points the index entry for
data/number.txt
at the new blob.
If the answer in the possible duplicate doesn't suffice you might want to take a look at this http://www.geekgumbo.com/2011/07/19/git-basics-how-git-saves-your-work/
To make a long story short, Git uses the SHA-1
of the file contents to keep track of changes. Git keeps track of four objects: a blob, a tree, a commit, and a tag.
To answer your question on how it keeps track of changes here's a quote from that link:
The tree object is how Git keeps track of file names and directories. There is a tree object for each directory. The tree object points to the SHA-1 blobs, the files, in that directory, and other trees, sub-directories at the time of the commit. Each tree object is encrypted into, you guessed it, a SHA-1 hash of its contents, and stored in .git/objects. The name of the trees, since they are SHA-1 hashes, allow Git to quickly see if there's been any changes to any files or directories by comparing the name to the previous name. Pretty slick.
I found the following explanation helpful in a recent course, Git Essential Training by Kevin Skoglund, I followed at Lynda.com.
Git generates a hash key composed of 40 hexadecimal characters by running an algorithm on the changes we have committed. For instance if we commit same set of changes at different occasions, we should get the same hash key.
Additionally it keeps track of previous changes by keeping following meta information in each commit.
Each subsequent commit will refer to a parent commit, while the first commit will not have a parent(or a null/nil value). The following diagram would be helpful in this regard.
Image credit : Git Essential Training by Kevin Skoglund at Lynda.com
Git Essential Training by Kevin Skoglund at Lynda.com
The mechanisms by which one determines the status of a file is fairly straightforward. To know what files have been staged, one simply diffs the HEAD
tree with the index. Any items that appear only in the index have been staged for addition, any items that appear only in HEAD
have been removed and any items that are different have had changes staged.
Similarly, one would detect unstaged changes by diff'ing the index with the working directory.
Your question in particular asks how this can be so fast (after all, computing the SHA1 hash of a file is not exactly speedy.) This is where the index - also known as the cache - comes in to play again. The index also has fields for the file size and file modification time. Thus one can simply stat(2)
a file on disk and compare against the index's file size and file modification time to know whether to hash the file or not.