In my repo, how long must the longest hash prefix be to prevent any overlap?

女生的网名这么多〃 提交于 2019-11-30 07:27:29

问题


The --abbrev-commit flag can be used in conjunction with git log and git rev-list in order to show partial prefixes instead of the full 40-character SHA-1 hashes of commit objects. According to the Pro Git book,

it defaults to using seven characters but makes them longer if necessary to keep the SHA-1 unambiguous [...]

Additionally, short SHAs are at least 4-character long. Still according to the Pro Git book,

Generally, eight to ten characters are more than enough to be unique within a project.

As an example, the Linux kernel, which is a pretty large project with over 450k commits and 3.6 million objects, has no two objects whose SHA-1s overlap more than the first 11 characters.

Since the length of the longest prefix required to prevent any overlap among all prefix hashes of commit objects (11, in the case of the Linux kernel) is a crude indicator of a repo's size, I'd like to programmatically determine the corresponding quantity in my own local repository. How can I do that?


回答1:


The following shell script, when run in a local repo, prints the length of the longest prefix required to prevent any overlap among all prefix hashes of commit objects of that repository.

MAX_LENGTH=4;

git rev-list --abbrev=4 --abbrev-commit --all | \
  ( while read -r line; do
      if [ ${#line} -gt $MAX_LENGTH ]; then
        MAX_LENGTH=${#line};
      fi
    done && printf %s\\n "$MAX_LENGTH"
  )

The last time I edited this answer, the script printed

  • "9" when run in a clone of the Git-project repo,
  • "9" when run in a clone of the OpenStack repo,
  • "11" when run in a clone of the Linux-kernel repo.



回答2:


Jubob's script is great, upvoted.

If you want to get an idea of the distribution of minimum-commit-hash-length, you can run this one-liner:

git rev-list --abbrev=4 --abbrev-commit --all | ( while read -r line; do echo ${#line}; done; ) | sort -n | uniq -c

For the git project itself today (git-on-git), this yields something like:

 1788 4
35086 5
 7881 6
  533 7
   39 8
    4 9

... yielding 1788 commits that can be represented uniquely with a 4-char hash (or lower, this is Git's minimum abbrev), and 4 commits which require 9-of-40 characters of the hash in-order to uniquely select them.

By comparison, a much larger project such as the Linux kernel, has this distribution today:

6179   5
446463 6
139247 7
10018  8
655    9
41    10
3     11

So with a database of nearly 5 million objects and 600k commits, there's 3 commits currently requiring 11 of 40 hexadecimal digits to distinguish them from all other commits.



来源:https://stackoverflow.com/questions/32405922/in-my-repo-how-long-must-the-longest-hash-prefix-be-to-prevent-any-overlap

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!