How does git assure that commit SHA keys for identical operations/data are still unique?

后端 未结 4 1865
Happy的楠姐
Happy的楠姐 2021-01-17 22:22

If I create a file foo with touch foo and then run shasum foo it will print out

da39a3ee5e6b4b0d3255bfef95601890afd80709

相关标签:
4条回答
  • 2021-01-17 22:54

    It doesn't, but you will have to manually construct the commit to get the timestamps to line up. You can manually construct a whole valid repository identical to another, by editing the .git/objects files, but because newer commits contain the hashes of older commits this will of course have to be exactly identical.

    0 讨论(0)
  • 2021-01-17 22:56

    This Gist from Carl Mäsak explains it better than I ever could:

    https://gist.github.com/masak/2415865

    alex@yuzu:~/foo$ git show HEAD
    commit 7438e0a18888854650e6a53a9a5d823d6382de45
    Author: Foo Bar <foo@example.com>
    Date:   Wed Jul 30 19:35:32 2014 -0700
    
        foo
    
    diff --git README README
    new file mode 100644
    index 0000000..e69de29
    

    Which is the SHA-1 checksum of "commit\0" followed by a character count (length) followed by git cat-file commit HEAD.

    alex@yuzu:~/foo$ git cat-file commit HEAD
    tree 543b9bebdc6bd5c4b22136034a95dd097a57d3dd
    author Foo Bar <foo@example.com> 1406774132 -0700
    committer Alex Balhatchet <kaoru@slackwise.net> 1406774132 -0700
    
    foo
    

    Put it all together and...

    alex@yuzu:~/foo$ (printf "commit %s\0" $(git cat-file commit HEAD | wc -c); git cat-file commit HEAD) | sha1sum
    7438e0a18888854650e6a53a9a5d823d6382de45  -
    

    The sha1sum output matches the commit SHA-1!

    0 讨论(0)
  • 2021-01-17 23:03

    The only things which are SHA-1'd to give the commit object its reference are what is shown by git show <commit>.

    commit e6e53f5256c47b039ed19e95a073484dbb97cbf7
    tree 543b9bebdc6bd5c4b22136034a95dd097a57d3dd
    author Alex Balhatchet <kaoru@slackwise.net> 1406774132 -0700
    committer Alex Balhatchet <kaoru@slackwise.net> 1406774132 -0700
    
        foo
    

    That is:

    1. The tree's id
    2. The author name, email
    3. The author commit timestamp
    4. The committer name, email
    5. The committer timestamp

    The reason the examples with --date from other answers haven't worked is because you need to override both the committer timestamp and the author timestamp.

    For example the following is completely repeatable:

    alex@yuzu:~$ ( mkdir foo ; cd foo ; git init ; export GIT_AUTHOR_DATE='Wed Jul 30 19:35:32 2014 -0700'; export GIT_COMMITTER_DATE=$GIT_AUTHOR_DATE; touch README; git add README; git commit README --message 'foo' --author 'Foo Bar <foo@example.com>'; git show HEAD --format=raw ; cd .. ; rm -rf foo ) 2>&1 | grep '^commit '
    commit 7438e0a18888854650e6a53a9a5d823d6382de45
    

    If you run it on your machine you should get exactly the same output.

    Update

    If you get different output it should at least be repeatable. For example I get different output for different versions of git; 1.7.10.4 reports a new empty README file as 0 files changed whereas 1.9.1 reports it as 1 file changed, 0 insertions(+), 0 deletions(-) which changes the commit object's contents.

    0 讨论(0)
  • 2021-01-17 23:07

    What I would like to know though is, what exactly does git do to assure commit hashes are always unique even if they do the exact same operations with exactly the same contents.

    Nothing. If you create the same contents, you get the same SHA-1.

    First, however, you need to realize that "same contents" of a commit means that—provided you don't get an accidental SHA-1 collision1 or find a way to break SHA-1—you must create the same complete repository history leading up to and including the commit itself, including all the same trees, author-names, time-stamps, and so on.

    This is because the contents of a commit are what you see if you run git cat-file -p <sha-1> on a commit (plus the tag-and-size field that says "this object is of type commit", so that there are no trivial ways to break things by creating a blob with the same contents as a previous commit). Here's one as an example:

    $ git cat-file -p 996b0fdbb4ff63bfd880b3901f054139c95611cf
    tree e760f781f2c997fd1d26f2779ac00d42ca93f534
    parent 6da748a7cebe3911448fabf9426f81c9df9ec54f
    parent 740c281d21ef5b27f6f1b942a4f2fc20f51e8c7e
    author Junio C Hamano <gitster@pobox.com> 1406140600 -0700
    committer Junio C Hamano <gitster@pobox.com> 1406140600 -0700
    
    Sync with v2.0.3
    
    * maint:
      Git 2.0.3
      .mailmap: combine Stefan Beller's emails
      git.1: switch homepage for stats
    

    Note that this string includes the tree and its SHA-1, both of this commit's parent SHA-1s, the author and timestamp, the committer and timestamp, and the message. If you change even a single bit of this—such as by trying to change the underlying tree, or using some different parent commit(s)—you will get a new, different SHA-1, rather than 996b0fdbb4ff63bfd880b3901f054139c95611cf.

    So the answer to this:

    So in theory if me and you do exactly the same steps at exactly the same time with exactly the same configured author, email etc, we would actually get the same commit SHA key?

    is "yes". However ... you must start with the same staging area (this is what will become the tree), and the same parent commits. If you then configure your author, email, etc., exactly the same as the other guy, and both of you create a new commit at the same second (or using git's environment variables2 to force the time stamps), you both get the same new commit.

    Which is precisely what we want. It doesn't matter if you create it, when you're named "me", or I create it, when I'm named "me", if all the rest of the contents are the same. Because whoever creates it, the other "me" can clone it, and then we both have the same thing that way too.

    (If I want to be sure that the "me" that creates something is not confused with the real me, I need to add something unique, that I know and the other me doesn't. Of course, if I publish this thing somewhere, the other me know knows it. But this is what signed, annotated tags are for. They can contain a GPG signature.)


    1The chances of an accidental hash collision (for any pair of objects; chances rise with more objects) are 1 out of 2160, which is ... very small. :-) The rise is actually very rapid, so that by the time you have a million objects, it's about 1 out of 2121. The formula I use here is:

    1 - exp(((-(n * (n-1))) / (2 * r))

    where r = 2160 and n is the number of objects. Without the subtraction from 1, the equation calculates the "safety margin", as it were: the chance that we won't have an accidental hash collision. If we want to keep this number in the same range as the safety margin that a disk drive won't read back the wrong contents for a file—or at least, that disk-makers claim—we need to keep it around 10-18, which means we need to avoid putting more than about 1.7 quadrillion (1.7E15) objects in our git databases.

    2There are many git environment variables that you can set to override various defaults. The ones for the author and committer, including date and email, are:

    • GIT_AUTHOR_NAME
    • GIT_AUTHOR_EMAIL
    • GIT_AUTHOR_DATE
    • GIT_COMMITTER_NAME
    • GIT_COMMITTER_EMAIL
    • GIT_COMMITTER_DATE
    • EMAIL

    as described in the git commit-tree documentation.

    0 讨论(0)
提交回复
热议问题