Git hash duplicates

梦想的初衷 提交于 2020-12-09 16:52:28

问题


Git allows to retrieve the hash of the commit with commands like:

git rev-parse HEAD

which gives 33b316c or

git rev-parse --short HEAD

which gives 33b316cbeeab3d69e79b9fb659414af4e7829a32 I know that long hashes in practice will never collide.

In practice, the short hashes are used much more often. I'd like to know what's the probability for the short ones to collide? Does git take any measures to overcome possible collisions (when for example using git checkout)?


回答1:


I give a formula in my book—see pp. 78-79—but if you're looking for a simple one, the point at which the probability of some hash collision reaches about 50% in an n-bit hash is when you hash roughly 2n/2 keys. The SHA-1 hash itself is 160 bits, represented as 40 hexadecimal digits, each representing 4 of the 160 bits. Truncating that to 7 hexadecimal digits leaves 28 bits, so you will reach 50%-chance-of-collision at about 214 keys, or 16384 objects. If you constrain the objects to be only commits, that's a pretty decent number of commits, but Git places all objects—commits, trees, annotated tag objects, and blobs—in a single hash-indexed key-value store.

The probability of the hashes of any given pair of keys colliding is just 1 in 2n, i.e., 1 in 228 or 1 out of 268 million. The reason it increases so fast to 50%, as the number of keys grows, is known as the Birthday Paradox or birthday problem. 50% is of course far too scary; with 28 bits, if we want the overall probability to be below 0.1%, we should keep the number of objects below about 1230. By going to 32 bits (8 character abbrevations) we double this to about 2460, but that's still not very many objects.

By the time you have 16k objects in your store, you probably should use at least 10 hexadecimal digits, giving 240 possible hash values and a p-bar value of about .99987794... (about .019% chance of collisions). Nine hex digits gives only 236 hash values, producing a p-bar of .99804890... or 0.19% chance of collision, which I think is too high.

If you can restrict your ambiguous-matching code to only commits—or only commit-ish, which in Git means commits or annotated tags—the built in defaults work pretty well. (Git will in fact do this in a lot of cases.) But Git's internal code for computing the "right" abbreviation length is, at least in my opinion, far too care-free, too "loosey-goosey", as it uses the 50%-collision-probability square-root trick in contexts where the resulting hash might be used to identify any object.

(As noted in comments, internally Git always uses the full hashes. It's only at the not-Git / Git interface, e.g., git log <hash> or git show <hash> user-facing commands, that you can type in an abbreviated hash, or ask for an abbreviated output hash. Here Git will default to using the 50%-collision-probability number to compute how many characters to show, starting with an estimate of the number of objects in the database. If you're supplying the hash, you choose how much to supply. If you're asking Git to provide it, you can still choose how much, with --abbrev=number. Note that there's an absolute minimum of 4: git log abc won't treat abc as a hash ID, but git log abcd will treat abcd as an abbreviated a hash ID. There's also a very old default of 7 characters, from the Git 1.7-ish days.)




回答2:


The chances of having duplicates in the list of a repo's short hashes increase very fast with the number of commits.

However, it doesn't cause too many problems since these are not at all meant to be unique, they're only a comfort feature, providing an alternative way to designate commits in some commands.

When you choose to feed a command a short hash and it actually is ambiguous in your repo (i. e. can't be resolved to a unique commit), git prompts you with a list of candidates and you get to choose which you intended to call.

Example output :

$ git ch 2334
error: short SHA1 2334 is ambiguous
hint: The candidates are:
hint:   233475fec8 commit 2018-02-14 - Merge pull request #175 in someRepo from someBranch to someOtherBranch
hint:   2334790f05 commit 2018-06-14 - Merge pull request #917 in someRepo from someBranch to yetAnotherBranch
hint:   233415cfa2 tree
hint:   233438d772 tree
hint:   23348a014a tree
error: pathspec '2334' did not match any file(s) known to git

This is for git checkout <shortHash> typed in the CLI for example, but I'm unsure as to how it behaves in the context of a script. Either it errors with an exit code 1 or it automatically takes one of the candidates, I'll have to check that.



来源:https://stackoverflow.com/questions/56012233/git-hash-duplicates

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!