I\'m after a regex that will validate a full complex UK postcode only within an input string. All of the uncommon postcode forms must be covered as well as the usual. For in
I recently posted an answer to this question on UK postcodes for the R language. I discovered that the UK Government's regex pattern is incorrect and fails to properly validate some postcodes. Unfortunately, many of the answers here are based on this incorrect pattern.
I'll outline some of these issues below and provide a revised regular expression that actually works.
My answer (and regular expressions in general):
If you don't care about the bad regex and just want to skip to the answer, scroll down to the Answer section.
The regular expressions in this section should not be used.
This is the failing regex that the UK government has provided developers (not sure how long this link will be up, but you can see it in their Bulk Data Transfer documentation):
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))[0-9][A-Za-z]{2})$
See regex in use here.
As many developers likely do, they copy/paste code (especially regular expressions) and paste them expecting them to work. While this is great in theory, it fails in this particular case because copy/pasting from this document actually changes one of the characters (a space) into a newline character as shown below:
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))
[0-9][A-Za-z]{2})$
The first thing most developers will do is just erase the newline without thinking twice. Now the regex won't match postcodes with spaces in them (other than the GIR 0AA
postcode).
To fix this issue, the newline character should be replaced with the space character:
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
^
See regex in use here.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
^^ ^ ^ ^^
The postcode regex improperly anchors the regex. Anyone using this regex to validate postcodes might be surprised if a value like fooA11 1AA
gets through. That's because they've anchored the start of the first option and the end of the second option (independently of one another), as pointed out in the regex above.
What this means is that ^
(asserts position at start of the line) only works on the first option ([Gg][Ii][Rr] 0[Aa]{2})
, so the second option will validate any strings that end in a postcode (regardless of what comes before).
Similarly, the first option isn't anchored to the end of the line $
, so GIR 0AAfoo
is also accepted.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))[0-9][A-Za-z]{2})$
To fix this issue, both options should be wrapped in another group (or non-capturing group) and the anchors placed around that:
^(([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))$
^^ ^^
See regex in use here.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
^^
The regex is missing a -
here to indicate a range of characters. As it stands, if a postcode is in the format ANA NAA
(where A
represents a letter and N
represents a number), and it begins with anything other than A
or Z
, it will fail.
That means it will match A1A 1AA
and Z1A 1AA
, but not B1A 1AA
.
To fix this issue, the character -
should be placed between the A
and Z
in the respective character set:
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
^
See regex in use here.
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
^
I swear they didn't even test this thing before publicizing it on the web. They made the wrong character set optional. They made [0-9]
option in the fourth sub-option of option 2 (group 9). This allows the regex to match incorrectly formatted postcodes like AAA 1AA
.
To fix this issue, make the next character class optional instead (and subsequently make the set [0-9]
match exactly once):
^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?)))) [0-9][A-Za-z]{2})$
^
Performance on this regex is extremely poor. First off, they placed the least likely pattern option to match GIR 0AA
at the beginning. How many users will likely have this postcode versus any other postcode; probably never? This means every time the regex is used, it must exhaust this option first before proceeding to the next option. To see how performance is impacted check the number of steps the original regex took (35) against the same regex after having flipped the options (22).
The second issue with performance is due to the way the entire regex is structured. There's no point backtracking over each option if one fails. The way the current regex is structured can greatly be simplified. I provide a fix for this in the Answer section.
See regex in use here
This may not be considered a problem, per se, but it does raise concern for most developers. The spaces in the regex are not optional, which means the users inputting their postcodes must place a space in the postcode. This is an easy fix by simply adding ?
after the spaces to render them optional. See the Answer section for a fix.
Fixing all the issues outlined in the Problems section and simplifying the pattern yields the following, shorter, more concise pattern. We can also remove most of the groups since we're validating the postcode as a whole (not individual parts):
See regex in use here
^([A-Za-z][A-Ha-hJ-Yj-y]?[0-9][A-Za-z0-9]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})$
This can further be shortened by removing all of the ranges from one of the cases (upper or lower case) and using a case-insensitive flag. Note: Some languages don't have one, so use the longer one above. Each language implements the case-insensitivity flag differently.
See regex in use here.
^([A-Z][A-HJ-Y]?[0-9][A-Z0-9]? ?[0-9][A-Z]{2}|GIR ?0A{2})$
Shorter again replacing [0-9]
with \d
(if your regex engine supports it):
See regex in use here.
^([A-Z][A-HJ-Y]?\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2})$
Without ensuring specific alphabetic characters, the following can be used (keep in mind the simplifications from 1. Fixing the UK Government's Regex have also been applied here):
See regex in use here.
^([A-Z]{1,2}\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2})$
And even further if you don't care about the special case GIR 0AA
:
^[A-Z]{1,2}\d[A-Z\d]? ?\d[A-Z]{2}$
I would not suggest over-verification of a postcode as new Areas, Districts and Sub-districts may appear at any point in time. What I will suggest potentially doing, is added support for edge-cases. Some special cases exist and are outlined in this Wikipedia article.
Here are complex regexes that include the subsections of 3. (3.1, 3.2, 3.3).
In relation to the patterns in 1. Fixing the UK Government's Regex:
See regex in use here
^(([A-Z][A-HJ-Y]?\d[A-Z\d]?|ASCN|STHL|TDCU|BBND|[BFS]IQQ|PCRN|TKCA) ?\d[A-Z]{2}|BFPO ?\d{1,4}|(KY\d|MSR|VG|AI)[ -]?\d{4}|[A-Z]{2} ?\d{2}|GE ?CX|GIR ?0A{2}|SAN ?TA1)$
And in relation to 2. Simplified Patterns:
See regex in use here
^(([A-Z]{1,2}\d[A-Z\d]?|ASCN|STHL|TDCU|BBND|[BFS]IQQ|PCRN|TKCA) ?\d[A-Z]{2}|BFPO ?\d{1,4}|(KY\d|MSR|VG|AI)[ -]?\d{4}|[A-Z]{2} ?\d{2}|GE ?CX|GIR ?0A{2}|SAN ?TA1)$
The Wikipedia article currently states (some formats slightly simplified):
AI-1111
: AnguilaASCN 1ZZ
: Ascension IslandSTHL 1ZZ
: Saint HelenaTDCU 1ZZ
: Tristan da CunhaBBND 1ZZ
: British Indian Ocean TerritoryBIQQ 1ZZ
: British Antarctic TerritoryFIQQ 1ZZ
: Falkland IslandsGX11 1ZZ
: GibraltarPCRN 1ZZ
: Pitcairn IslandsSIQQ 1ZZ
: South Georgia and the South Sandwich IslandsTKCA 1ZZ
: Turks and Caicos IslandsBFPO 11
: Akrotiri and DhekeliaZZ 11
& GE CX
: Bermuda (according to this document)KY1-1111
: Cayman Islands (according to this document)VG1111
: British Virgin Islands (according to this document)MSR 1111
: Montserrat (according to this document)An all-encompassing regex to match only the British Overseas Territories might look like this:
See regex in use here.
^((ASCN|STHL|TDCU|BBND|[BFS]IQQ|GX\d{2}|PCRN|TKCA) ?\d[A-Z]{2}|(KY\d|MSR|VG|AI)[ -]?\d{4}|(BFPO|[A-Z]{2}) ?\d{2}|GE ?CX)$
Although they've been recently changed it to better align with the British postcode system to BF#
(where #
represents a number), they're considered optional alternative postcodes. These postcodes follow(ed) the format of BFPO
, followed by 1-4 digits:
See regex in use here
^BFPO ?\d{1,4}$
There's another special case with Santa (as mentioned in other answers): SAN TA1
is a valid postcode. A regex for this is very simply:
^SAN ?TA1$