How does git assure that commit SHA keys for identical operations/data are still unique?

后端未结

关注

 4  1868

If I create a file foo with touch foo and then run shasum foo it will print out

da39a3ee5e6b4b0d3255bfef95601890afd80709


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  [愿得一人]        
                
              
                            
                2021-01-17 22:54
              
            
            
                                                                       
It doesn't, but you will have to manually construct the commit to get the timestamps to line up. You can manually construct a whole valid repository identical to another, by editing the .git/objects files, but because newer commits contain the hashes of older commits this will of course have to be exactly identical.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  离开以前        
                
              
                            
                2021-01-17 22:56
              
            
            
                                                                       
This Gist from Carl Mäsak explains it better than I ever could:

https://gist.github.com/masak/2415865

alex@yuzu:~/foo$ git show HEAD
commit 7438e0a18888854650e6a53a9a5d823d6382de45
Author: Foo Bar <foo@example.com>
Date:   Wed Jul 30 19:35:32 2014 -0700

    foo

diff --git README README
new file mode 100644
index 0000000..e69de29


Which is the SHA-1 checksum of "commit\0" followed by a character count (length) followed by git cat-file commit HEAD.

alex@yuzu:~/foo$ git cat-file commit HEAD
tree 543b9bebdc6bd5c4b22136034a95dd097a57d3dd
author Foo Bar <foo@example.com> 1406774132 -0700
committer Alex Balhatchet <kaoru@slackwise.net> 1406774132 -0700

foo


Put it all together and...

alex@yuzu:~/foo$ (printf "commit %s\0" $(git cat-file commit HEAD | wc -c); git cat-file commit HEAD) | sha1sum
7438e0a18888854650e6a53a9a5d823d6382de45  -


The sha1sum output matches the commit SHA-1!
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲哀的现实        
                
              
                            
                2021-01-17 23:03
              
            
            
                                                                       
The only things which are SHA-1'd to give the commit object its reference are what is shown by git show <commit>.

commit e6e53f5256c47b039ed19e95a073484dbb97cbf7
tree 543b9bebdc6bd5c4b22136034a95dd097a57d3dd
author Alex Balhatchet <kaoru@slackwise.net> 1406774132 -0700
committer Alex Balhatchet <kaoru@slackwise.net> 1406774132 -0700

    foo


That is:


The tree's id
The author name, email
The author commit timestamp
The committer name, email
The committer timestamp


The reason the examples with --date from other answers haven't worked is because you need to override both the committer timestamp and the author timestamp.

For example the following is completely repeatable:

alex@yuzu:~$ ( mkdir foo ; cd foo ; git init ; export GIT_AUTHOR_DATE='Wed Jul 30 19:35:32 2014 -0700'; export GIT_COMMITTER_DATE=$GIT_AUTHOR_DATE; touch README; git add README; git commit README --message 'foo' --author 'Foo Bar <foo@example.com>'; git show HEAD --format=raw ; cd .. ; rm -rf foo ) 2>&1 | grep '^commit '
commit 7438e0a18888854650e6a53a9a5d823d6382de45


If you run it on your machine you should get exactly the same output.

Update

If you get different output it should at least be repeatable. For example I get different output for different versions of git; 1.7.10.4 reports a new empty README file as 0 files changed whereas 1.9.1 reports it as 1 file changed, 0 insertions(+), 0 deletions(-) which changes the commit object's contents.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我寻月下人不归        
                
              
                            
                2021-01-17 23:07
              
            
            
                                                                       

  What I would like to know though is, what exactly does git do to assure commit hashes are always unique even if they do the exact same operations with exactly the same contents.


Nothing.  If you create the same contents, you get the same SHA-1.

First, however, you need to realize that "same contents" of a commit means that—provided you don't get an accidental SHA-1 collision¹ or find a way to break SHA-1—you must create the same complete repository history leading up to and including the commit itself, including all the same trees, author-names, time-stamps, and so on.

This is because the contents of a commit are what you see if you run git cat-file -p <sha-1> on a commit (plus the tag-and-size field that says "this object is of type commit", so that there are no trivial ways to break things by creating a blob with the same contents as a previous commit).  Here's one as an example:

$ git cat-file -p 996b0fdbb4ff63bfd880b3901f054139c95611cf
tree e760f781f2c997fd1d26f2779ac00d42ca93f534
parent 6da748a7cebe3911448fabf9426f81c9df9ec54f
parent 740c281d21ef5b27f6f1b942a4f2fc20f51e8c7e
author Junio C Hamano <gitster@pobox.com> 1406140600 -0700
committer Junio C Hamano <gitster@pobox.com> 1406140600 -0700

Sync with v2.0.3

* maint:
  Git 2.0.3
  .mailmap: combine Stefan Beller's emails
  git.1: switch homepage for stats


Note that this string includes the tree and its SHA-1, both of this commit's parent SHA-1s, the author and timestamp, the committer and timestamp, and the message.  If you change even a single bit of this—such as by trying to change the underlying tree, or using some different parent commit(s)—you will get a new, different SHA-1, rather than 996b0fdbb4ff63bfd880b3901f054139c95611cf.

So the answer to this:


  So in theory if me and you do exactly the same steps at exactly the same time with exactly the same configured author, email etc, we would actually get the same commit SHA key?


is "yes".  However ... you must start with the same staging area (this is what will become the tree), and the same parent commits.  If you then configure your author, email, etc., exactly the same as the other guy, and both of you create a new commit at the same second (or using git's environment variables² to force the time stamps), you both get the same new commit.

Which is precisely what we want.  It doesn't matter if you create it, when you're named "me", or I create it, when I'm named "me", if all the rest of the contents are the same.  Because whoever creates it, the other "me" can clone it, and then we both have the same thing that way too.

(If I want to be sure that the "me" that creates something is not confused with the real me, I need to add something unique, that I know and the other me doesn't.  Of course, if I publish this thing somewhere, the other me know knows it.  But this is what signed, annotated tags are for.  They can contain a GPG signature.)



¹The chances of an accidental hash collision (for any pair of objects; chances rise with more objects) are 1 out of 2¹⁶⁰, which is ... very small. :-)  The rise is actually very rapid, so that by the time you have a million objects, it's about 1 out of 2¹²¹.  The formula I use here is:

1 - exp(((-(n * (n-1))) / (2 * r))

where r = 2¹⁶⁰ and n is the number of objects.  Without the subtraction from 1, the equation calculates the "safety margin", as it were: the chance that we won't have an accidental hash collision.  If we want to keep this number in the same range as the safety margin that a disk drive won't read back the wrong contents for a file—or at least, that disk-makers claim—we need to keep it around 10^-18, which means we need to avoid putting more than about 1.7 quadrillion (1.7E15) objects in our git databases.

²There are many git environment variables that you can set to override various defaults.  The ones for the author and committer, including date and email, are:


GIT_AUTHOR_NAME
GIT_AUTHOR_EMAIL
GIT_AUTHOR_DATE
GIT_COMMITTER_NAME
GIT_COMMITTER_EMAIL
GIT_COMMITTER_DATE
EMAIL


as described in the git commit-tree documentation.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复