A Regex to match a SHA1

后端 未结 6 1981
旧巷少年郎
旧巷少年郎 2020-12-02 18:37

I\'m trying to match SHA1\'s in generic text with a regular expression.

Ideally I want to avoid matching words.

It\'s safe to say that full SHA1\'s have a d

相关标签:
6条回答
  • 2020-12-02 18:56

    What exactly are you trying to do? You shouldn't need to parse anything git outputs with heuristics -- you can always request exactly the data you need.

    If you want to match a full hex representation of an SHA1 sum, try:

    /\b([a-f0-9]{40})\b/
    

    That is, a word consisting of 40 characters which are either digits or the letters a through f.

    If you only have a few characters and don't know where they are, then you are pretty much out of luck. Is "e78fd98" an abbreviated commit ID? Maybe, but what about "1234567"? Is that a commit ID? A problem ticket number? A number that makes a test fail?

    Without context, you can't really know what the data means.

    To answer your direct question, there is no property of SHA1 that would make the first three characters (in hex form) digits. You are just lucky, or perhaps unlucky, depending on how you look at it.

    0 讨论(0)
  • 2020-12-02 18:58

    I use this in ruby. It allows for a short version of the sha (6 - 8 in case of clashes) and for the full sha at 40 chars long.

    \A(([0-9a-f]{40})|([0-9a-f]{6,8}))\z
    
    0 讨论(0)
  • 2020-12-02 19:00

    I'm going to assume you want to match against hexadecimal printed representation of a SHA1, and not against the equivalent 20 raw bytes. Furthermore, I'm going to assume that the SHA1's in question use only lower-case letters to represent hex digits. You'll have to adjust the regular expression if your requirements differ.

    grep -o -E -e "[0-9a-f]{40}"
    

    Will match such a SHA1. You'll need to translate the above regular expression from egrep's dialect to whatever tool you happen to be using. Since the match must be exactly 40 characters long I don't think you're in danger of accidentally matching words. I don't know of any 40-character words that consist only of the letters a through f.

    edit:

    Better yet: use A Regex to match a SHA1 as his solution includes checking for word boundaries at both ends. I overlooked that above.

    0 讨论(0)
  • 2020-12-02 19:05

    For this type of hash : 43:A4:02:B7:B6:1D:89:86:C5:CE:AD:52:96:D9:2E:7B:64:98:45:6A :

    /^[0-9A-F]{2}(:[0-9A-F]{2}){19}$/
    
    0 讨论(0)
  • 2020-12-02 19:09

    You can consider the SHA1 hashes to be completely random, so this reduces to a matter of probabilities. The probability that a given digit is not a number is 6/16, or 0.375. The probability that three SHA1 digits are all not numbers is 0.375 ** 3, or 0.0527 (5% ish). At six digits, this reduces again to 0.00278 (0.2%). At five digits, the probability of all letters drops below 1% (you said you wanted to match 99% of the time).

    It's easy to craft a regular expression that always matches SHA1 values:

    \b[0-9a-f]{5,40}\b
    

    However, this may also match perfectly good five letter words, like "added" or "faded". In my /usr/share/dict/words file, there are several six letter words that would match: "accede", "beaded", "bedded", "decade", "deface", "efface", and "facade" are the most likely. At seven letters, there is only "deedeed" which is unlikely to appear in prose. It all depends on how many false positives you can tolerate, and what the likely words you will encounter actually are.

    0 讨论(0)
  • 2020-12-02 19:09

    If you have access to the repo, you can use git cat-file -e to check for sure that it represents an object in the repo. This is very fast, too. If you further want to restrict this to just commits and tags, you can use git cat-file -t to find out the type of the object.

    This could be used, for example, to search human-generated text for mentions of git commits and generate hyperlinks to a git web interface.

    0 讨论(0)
提交回复
热议问题