RegEx for matching UK Postcodes

前端 未结 30 2478
广开言路
广开言路 2020-11-22 01:38

I\'m after a regex that will validate a full complex UK postcode only within an input string. All of the uncommon postcode forms must be covered as well as the usual. For in

相关标签:
30条回答
  • 2020-11-22 01:45

    According to this Wikipedia table

    enter image description here

    This pattern cover all the cases

    (?:[A-Za-z]\d ?\d[A-Za-z]{2})|(?:[A-Za-z][A-Za-z\d]\d ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d{2} ?\d[A-Za-z]{2})|(?:[A-Za-z]\d[A-Za-z] ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d[A-Za-z] ?\d[A-Za-z]{2})
    

    When using it on Android\Java use \\d

    0 讨论(0)
  • 2020-11-22 01:46

    We were given a spec:

    UK postcodes must be in one of the following forms (with one exception, see below): 
        § A9 9AA 
        § A99 9AA
        § AA9 9AA
        § AA99 9AA
        § A9A 9AA
        § AA9A 9AA
    where A represents an alphabetic character and 9 represents a numeric character.
    Additional rules apply to alphabetic characters, as follows:
        § The character in position 1 may not be Q, V or X
        § The character in position 2 may not be I, J or Z
        § The character in position 3 may not be I, L, M, N, O, P, Q, R, V, X, Y or Z
        § The character in position 4 may not be C, D, F, G, I, J, K, L, O, Q, S, T, U or Z
        § The characters in the rightmost two positions may not be C, I, K, M, O or V
    The one exception that does not follow these general rules is the postcode "GIR 0AA", which is a special valid postcode.

    We came up with this:

    /^([A-PR-UWYZ][A-HK-Y0-9](?:[A-HJKS-UW0-9][ABEHMNPRV-Y0-9]?)?\s*[0-9][ABD-HJLNP-UW-Z]{2}|GIR\s*0AA)$/i
    

    But note - this allows any number of spaces in between groups.

    0 讨论(0)
  • 2020-11-22 01:46

    Through empirical testing and observation, as well as confirming with https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation, here is my version of a Python regex that correctly parses and validates a UK postcode:

    UK_POSTCODE_REGEX = r'(?P<postcode_area>[A-Z]{1,2})(?P<district>(?:[0-9]{1,2})|(?:[0-9][A-Z]))(?P<sector>[0-9])(?P<postcode>[A-Z]{2})'

    This regex is simple and has capture groups. It does not include all of the validations of legal UK postcodes, but only takes into account the letter vs number positions.

    Here is how I would use it in code:

    @dataclass
    class UKPostcode:
        postcode_area: str
        district: str
        sector: int
        postcode: str
    
        # https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation
        # Original author of this regex: @jontsai
        # NOTE TO FUTURE DEVELOPER:
        # Verified through empirical testing and observation, as well as confirming with the Wiki article
        # If this regex fails to capture all valid UK postcodes, then I apologize, for I am only human.
        UK_POSTCODE_REGEX = r'(?P<postcode_area>[A-Z]{1,2})(?P<district>(?:[0-9]{1,2})|(?:[0-9][A-Z]))(?P<sector>[0-9])(?P<postcode>[A-Z]{2})'
    
        @classmethod
        def from_postcode(cls, postcode):
            """Parses a string into a UKPostcode
    
            Returns a UKPostcode or None
            """
            m = re.match(cls.UK_POSTCODE_REGEX, postcode.replace(' ', ''))
    
            if m:
                uk_postcode = UKPostcode(
                    postcode_area=m.group('postcode_area'),
                    district=m.group('district'),
                    sector=m.group('sector'),
                    postcode=m.group('postcode')
                )
            else:
                uk_postcode = None
    
            return uk_postcode
    
    
    def parse_uk_postcode(postcode):
        """Wrapper for UKPostcode.from_postcode
        """
        uk_postcode = UKPostcode.from_postcode(postcode)
        return uk_postcode
    

    Here are unit tests:

    @pytest.mark.parametrize(
        'postcode, expected', [
            # https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation
            (
                'EC1A1BB',
                UKPostcode(
                    postcode_area='EC',
                    district='1A',
                    sector='1',
                    postcode='BB'
                ),
            ),
            (
                'W1A0AX',
                UKPostcode(
                    postcode_area='W',
                    district='1A',
                    sector='0',
                    postcode='AX'
                ),
            ),
            (
                'M11AE',
                UKPostcode(
                    postcode_area='M',
                    district='1',
                    sector='1',
                    postcode='AE'
                ),
            ),
            (
                'B338TH',
                UKPostcode(
                    postcode_area='B',
                    district='33',
                    sector='8',
                    postcode='TH'
                )
            ),
            (
                'CR26XH',
                UKPostcode(
                    postcode_area='CR',
                    district='2',
                    sector='6',
                    postcode='XH'
                )
            ),
            (
                'DN551PT',
                UKPostcode(
                    postcode_area='DN',
                    district='55',
                    sector='1',
                    postcode='PT'
                )
            )
        ]
    )
    def test_parse_uk_postcode(postcode, expected):
        uk_postcode = parse_uk_postcode(postcode)
        assert(uk_postcode == expected)
    
    0 讨论(0)
  • 2020-11-22 01:47

    It looks like we're going to be using ^(GIR ?0AA|[A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]([0-9ABEHMNPRV-Y])?)|[0-9][A-HJKPS-UW]) ?[0-9][ABD-HJLNP-UW-Z]{2})$, which is a slightly modified version of that sugested by Minglis above.

    However, we're going to have to investigate exactly what the rules are, as the various solutions listed above appear to apply different rules as to which letters are allowed.

    After some research, we've found some more information. Apparently a page on 'govtalk.gov.uk' points you to a postcode specification govtalk-postcodes. This points to an XML schema at XML Schema which provides a 'pseudo regex' statement of the postcode rules.

    We've taken that and worked on it a little to give us the following expression:

    ^((GIR &0AA)|((([A-PR-UWYZ][A-HK-Y]?[0-9][0-9]?)|(([A-PR-UWYZ][0-9][A-HJKSTUW])|([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRV-Y]))) &[0-9][ABD-HJLNP-UW-Z]{2}))$
    

    This makes spaces optional, but does limit you to one space (replace the '&' with '{0,} for unlimited spaces). It assumes all text must be upper-case.

    If you want to allow lower case, with any number of spaces, use:

    ^(([gG][iI][rR] {0,}0[aA]{2})|((([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y]?[0-9][0-9]?)|(([a-pr-uwyzA-PR-UWYZ][0-9][a-hjkstuwA-HJKSTUW])|([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y][0-9][abehmnprv-yABEHMNPRV-Y]))) {0,}[0-9][abd-hjlnp-uw-zABD-HJLNP-UW-Z]{2}))$
    

    This doesn't cover overseas territories and only enforces the format, NOT the existence of different areas. It is based on the following rules:

    Can accept the following formats:

    • “GIR 0AA”
    • A9 9ZZ
    • A99 9ZZ
    • AB9 9ZZ
    • AB99 9ZZ
    • A9C 9ZZ
    • AD9E 9ZZ

    Where:

    • 9 can be any single digit number.
    • A can be any letter except for Q, V or X.
    • B can be any letter except for I, J or Z.
    • C can be any letter except for I, L, M, N, O, P, Q, R, V, X, Y or Z.
    • D can be any letter except for I, J or Z.
    • E can be any of A, B, E, H, M, N, P, R, V, W, X or Y.
    • Z can be any letter except for C, I, K, M, O or V.

    Best wishes

    Colin

    0 讨论(0)
  • 2020-11-22 01:49

    I had a look into some of the answers above and I'd recommend against using the pattern from @Dan's answer (c. Dec 15 '10), since it incorrectly flags almost 0.4% of valid postcodes as invalid, while the others do not.

    Ordnance Survey provide service called Code Point Open which:

    contains a list of all the current postcode units in Great Britain

    I ran each of the regexs above against the full list of postcodes (Jul 6 '13) from this data using grep:

    cat CSV/*.csv |
        # Strip leading quotes
        sed -e 's/^"//g' |
        # Strip trailing quote and everything after it
        sed -e 's/".*//g' |
        # Strip any spaces
        sed -E -e 's/ +//g' |
        # Find any lines that do not match the expression
        grep --invert-match --perl-regexp "$pattern"
    

    There are 1,686,202 postcodes total.

    The following are the numbers of valid postcodes that do not match each $pattern:

    '^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]?[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$'
    # => 6016 (0.36%)
    
    '^(GIR ?0AA|[A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]([0-9ABEHMNPRV-Y])?)|[0-9][A-HJKPS-UW]) ?[0-9][ABD-HJLNP-UW-Z]{2})$'
    # => 0
    
    '^GIR[ ]?0AA|((AB|AL|B|BA|BB|BD|BH|BL|BN|BR|BS|BT|BX|CA|CB|CF|CH|CM|CO|CR|CT|CV|CW|DA|DD|DE|DG|DH|DL|DN|DT|DY|E|EC|EH|EN|EX|FK|FY|G|GL|GY|GU|HA|HD|HG|HP|HR|HS|HU|HX|IG|IM|IP|IV|JE|KA|KT|KW|KY|L|LA|LD|LE|LL|LN|LS|LU|M|ME|MK|ML|N|NE|NG|NN|NP|NR|NW|OL|OX|PA|PE|PH|PL|PO|PR|RG|RH|RM|S|SA|SE|SG|SK|SL|SM|SN|SO|SP|SR|SS|ST|SW|SY|TA|TD|TF|TN|TQ|TR|TS|TW|UB|W|WA|WC|WD|WF|WN|WR|WS|WV|YO|ZE)(\d[\dA-Z]?[ ]?\d[ABD-HJLN-UW-Z]{2}))|BFPO[ ]?\d{1,4}$'
    # => 0
    

    Of course, these results only deal with valid postcodes that are incorrectly flagged as invalid. So:

    '^.*$'
    # => 0
    

    I'm saying nothing about which pattern is the best regarding filtering out invalid postcodes.

    0 讨论(0)
  • 2020-11-22 01:49

    Postcodes are subject to change, and the only true way of validating a postcode is to have the complete list of postcodes and see if it's there.

    But regular expressions are useful because they:

    • are easy to use and implement
    • are short
    • are quick to run
    • are quite easy to maintain (compared to a full list of postcodes)
    • still catch most input errors

    But regular expressions tend to be difficult to maintain, especially for someone who didn't come up with it in the first place. So it must be:

    • as easy to understand as possible
    • relatively future proof

    That means that most of the regular expressions in this answer aren't good enough. E.g. I can see that [A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRV-Y] is going to match a postcode area of the form AA1A — but it's going to be a pain in the neck if and when a new postcode area gets added, because it's difficult to understand which postcode areas it matches.

    I also want my regular expression to match the first and second half of the postcode as parenthesised matches.

    So I've come up with this:

    (GIR(?=\s*0AA)|(?:[BEGLMNSW]|[A-Z]{2})[0-9](?:[0-9]|(?<=N1|E1|SE1|SW1|W1|NW1|EC[0-9]|WC[0-9])[A-HJ-NP-Z])?)\s*([0-9][ABD-HJLNP-UW-Z]{2})
    

    In PCRE format it can be written as follows:

    /^
      ( GIR(?=\s*0AA) # Match the special postcode "GIR 0AA"
        |
        (?:
          [BEGLMNSW] | # There are 8 single-letter postcode areas
          [A-Z]{2}     # All other postcode areas have two letters
          )
        [0-9] # There is always at least one number after the postcode area
        (?:
          [0-9] # And an optional extra number
          |
          # Only certain postcode areas can have an extra letter after the number
          (?<=N1|E1|SE1|SW1|W1|NW1|EC[0-9]|WC[0-9])
          [A-HJ-NP-Z] # Possible letters here may change, but [IO] will never be used
          )?
        )
      \s*
      ([0-9][ABD-HJLNP-UW-Z]{2}) # The last two letters cannot be [CIKMOV]
    $/x
    

    For me this is the right balance between validating as much as possible, while at the same time future-proofing and allowing for easy maintenance.

    0 讨论(0)
提交回复
热议问题