I\'m after a regex that will validate a full complex UK postcode only within an input string. All of the uncommon postcode forms must be covered as well as the usual. For in
According to this Wikipedia table
This pattern cover all the cases
(?:[A-Za-z]\d ?\d[A-Za-z]{2})|(?:[A-Za-z][A-Za-z\d]\d ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d{2} ?\d[A-Za-z]{2})|(?:[A-Za-z]\d[A-Za-z] ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d[A-Za-z] ?\d[A-Za-z]{2})
When using it on Android\Java use \\d
We were given a spec:
UK postcodes must be in one of the following forms (with one exception, see below): § A9 9AA § A99 9AA § AA9 9AA § AA99 9AA § A9A 9AA § AA9A 9AA where A represents an alphabetic character and 9 represents a numeric character. Additional rules apply to alphabetic characters, as follows: § The character in position 1 may not be Q, V or X § The character in position 2 may not be I, J or Z § The character in position 3 may not be I, L, M, N, O, P, Q, R, V, X, Y or Z § The character in position 4 may not be C, D, F, G, I, J, K, L, O, Q, S, T, U or Z § The characters in the rightmost two positions may not be C, I, K, M, O or V The one exception that does not follow these general rules is the postcode "GIR 0AA", which is a special valid postcode.
We came up with this:
/^([A-PR-UWYZ][A-HK-Y0-9](?:[A-HJKS-UW0-9][ABEHMNPRV-Y0-9]?)?\s*[0-9][ABD-HJLNP-UW-Z]{2}|GIR\s*0AA)$/i
But note - this allows any number of spaces in between groups.
Through empirical testing and observation, as well as confirming with https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation, here is my version of a Python regex that correctly parses and validates a UK postcode:
UK_POSTCODE_REGEX = r'(?P<postcode_area>[A-Z]{1,2})(?P<district>(?:[0-9]{1,2})|(?:[0-9][A-Z]))(?P<sector>[0-9])(?P<postcode>[A-Z]{2})'
This regex is simple and has capture groups. It does not include all of the validations of legal UK postcodes, but only takes into account the letter vs number positions.
Here is how I would use it in code:
@dataclass
class UKPostcode:
postcode_area: str
district: str
sector: int
postcode: str
# https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation
# Original author of this regex: @jontsai
# NOTE TO FUTURE DEVELOPER:
# Verified through empirical testing and observation, as well as confirming with the Wiki article
# If this regex fails to capture all valid UK postcodes, then I apologize, for I am only human.
UK_POSTCODE_REGEX = r'(?P<postcode_area>[A-Z]{1,2})(?P<district>(?:[0-9]{1,2})|(?:[0-9][A-Z]))(?P<sector>[0-9])(?P<postcode>[A-Z]{2})'
@classmethod
def from_postcode(cls, postcode):
"""Parses a string into a UKPostcode
Returns a UKPostcode or None
"""
m = re.match(cls.UK_POSTCODE_REGEX, postcode.replace(' ', ''))
if m:
uk_postcode = UKPostcode(
postcode_area=m.group('postcode_area'),
district=m.group('district'),
sector=m.group('sector'),
postcode=m.group('postcode')
)
else:
uk_postcode = None
return uk_postcode
def parse_uk_postcode(postcode):
"""Wrapper for UKPostcode.from_postcode
"""
uk_postcode = UKPostcode.from_postcode(postcode)
return uk_postcode
Here are unit tests:
@pytest.mark.parametrize(
'postcode, expected', [
# https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Validation
(
'EC1A1BB',
UKPostcode(
postcode_area='EC',
district='1A',
sector='1',
postcode='BB'
),
),
(
'W1A0AX',
UKPostcode(
postcode_area='W',
district='1A',
sector='0',
postcode='AX'
),
),
(
'M11AE',
UKPostcode(
postcode_area='M',
district='1',
sector='1',
postcode='AE'
),
),
(
'B338TH',
UKPostcode(
postcode_area='B',
district='33',
sector='8',
postcode='TH'
)
),
(
'CR26XH',
UKPostcode(
postcode_area='CR',
district='2',
sector='6',
postcode='XH'
)
),
(
'DN551PT',
UKPostcode(
postcode_area='DN',
district='55',
sector='1',
postcode='PT'
)
)
]
)
def test_parse_uk_postcode(postcode, expected):
uk_postcode = parse_uk_postcode(postcode)
assert(uk_postcode == expected)
It looks like we're going to be using ^(GIR ?0AA|[A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]([0-9ABEHMNPRV-Y])?)|[0-9][A-HJKPS-UW]) ?[0-9][ABD-HJLNP-UW-Z]{2})$
, which is a slightly modified version of that sugested by Minglis above.
However, we're going to have to investigate exactly what the rules are, as the various solutions listed above appear to apply different rules as to which letters are allowed.
After some research, we've found some more information. Apparently a page on 'govtalk.gov.uk' points you to a postcode specification govtalk-postcodes. This points to an XML schema at XML Schema which provides a 'pseudo regex' statement of the postcode rules.
We've taken that and worked on it a little to give us the following expression:
^((GIR &0AA)|((([A-PR-UWYZ][A-HK-Y]?[0-9][0-9]?)|(([A-PR-UWYZ][0-9][A-HJKSTUW])|([A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRV-Y]))) &[0-9][ABD-HJLNP-UW-Z]{2}))$
This makes spaces optional, but does limit you to one space (replace the '&' with '{0,} for unlimited spaces). It assumes all text must be upper-case.
If you want to allow lower case, with any number of spaces, use:
^(([gG][iI][rR] {0,}0[aA]{2})|((([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y]?[0-9][0-9]?)|(([a-pr-uwyzA-PR-UWYZ][0-9][a-hjkstuwA-HJKSTUW])|([a-pr-uwyzA-PR-UWYZ][a-hk-yA-HK-Y][0-9][abehmnprv-yABEHMNPRV-Y]))) {0,}[0-9][abd-hjlnp-uw-zABD-HJLNP-UW-Z]{2}))$
This doesn't cover overseas territories and only enforces the format, NOT the existence of different areas. It is based on the following rules:
Can accept the following formats:
Where:
Best wishes
Colin
I had a look into some of the answers above and I'd recommend against using the pattern from @Dan's answer (c. Dec 15 '10), since it incorrectly flags almost 0.4% of valid postcodes as invalid, while the others do not.
Ordnance Survey provide service called Code Point Open which:
contains a list of all the current postcode units in Great Britain
I ran each of the regexs above against the full list of postcodes (Jul 6 '13) from this data using grep
:
cat CSV/*.csv |
# Strip leading quotes
sed -e 's/^"//g' |
# Strip trailing quote and everything after it
sed -e 's/".*//g' |
# Strip any spaces
sed -E -e 's/ +//g' |
# Find any lines that do not match the expression
grep --invert-match --perl-regexp "$pattern"
There are 1,686,202 postcodes total.
The following are the numbers of valid postcodes that do not match each $pattern
:
'^([A-PR-UWYZ0-9][A-HK-Y0-9][AEHMNPRTVXY0-9]?[ABEHMNPRVWXY0-9]?[0-9][ABD-HJLN-UW-Z]{2}|GIR 0AA)$'
# => 6016 (0.36%)
'^(GIR ?0AA|[A-PR-UWYZ]([0-9]{1,2}|([A-HK-Y][0-9]([0-9ABEHMNPRV-Y])?)|[0-9][A-HJKPS-UW]) ?[0-9][ABD-HJLNP-UW-Z]{2})$'
# => 0
'^GIR[ ]?0AA|((AB|AL|B|BA|BB|BD|BH|BL|BN|BR|BS|BT|BX|CA|CB|CF|CH|CM|CO|CR|CT|CV|CW|DA|DD|DE|DG|DH|DL|DN|DT|DY|E|EC|EH|EN|EX|FK|FY|G|GL|GY|GU|HA|HD|HG|HP|HR|HS|HU|HX|IG|IM|IP|IV|JE|KA|KT|KW|KY|L|LA|LD|LE|LL|LN|LS|LU|M|ME|MK|ML|N|NE|NG|NN|NP|NR|NW|OL|OX|PA|PE|PH|PL|PO|PR|RG|RH|RM|S|SA|SE|SG|SK|SL|SM|SN|SO|SP|SR|SS|ST|SW|SY|TA|TD|TF|TN|TQ|TR|TS|TW|UB|W|WA|WC|WD|WF|WN|WR|WS|WV|YO|ZE)(\d[\dA-Z]?[ ]?\d[ABD-HJLN-UW-Z]{2}))|BFPO[ ]?\d{1,4}$'
# => 0
Of course, these results only deal with valid postcodes that are incorrectly flagged as invalid. So:
'^.*$'
# => 0
I'm saying nothing about which pattern is the best regarding filtering out invalid postcodes.
Postcodes are subject to change, and the only true way of validating a postcode is to have the complete list of postcodes and see if it's there.
But regular expressions are useful because they:
But regular expressions tend to be difficult to maintain, especially for someone who didn't come up with it in the first place. So it must be:
That means that most of the regular expressions in this answer aren't good enough. E.g. I can see that [A-PR-UWYZ][A-HK-Y][0-9][ABEHMNPRV-Y]
is going to match a postcode area of the form AA1A — but it's going to be a pain in the neck if and when a new postcode area gets added, because it's difficult to understand which postcode areas it matches.
I also want my regular expression to match the first and second half of the postcode as parenthesised matches.
So I've come up with this:
(GIR(?=\s*0AA)|(?:[BEGLMNSW]|[A-Z]{2})[0-9](?:[0-9]|(?<=N1|E1|SE1|SW1|W1|NW1|EC[0-9]|WC[0-9])[A-HJ-NP-Z])?)\s*([0-9][ABD-HJLNP-UW-Z]{2})
In PCRE format it can be written as follows:
/^
( GIR(?=\s*0AA) # Match the special postcode "GIR 0AA"
|
(?:
[BEGLMNSW] | # There are 8 single-letter postcode areas
[A-Z]{2} # All other postcode areas have two letters
)
[0-9] # There is always at least one number after the postcode area
(?:
[0-9] # And an optional extra number
|
# Only certain postcode areas can have an extra letter after the number
(?<=N1|E1|SE1|SW1|W1|NW1|EC[0-9]|WC[0-9])
[A-HJ-NP-Z] # Possible letters here may change, but [IO] will never be used
)?
)
\s*
([0-9][ABD-HJLNP-UW-Z]{2}) # The last two letters cannot be [CIKMOV]
$/x
For me this is the right balance between validating as much as possible, while at the same time future-proofing and allowing for easy maintenance.