Good evening,
I\'m trying to splitting the parts of a german address string into its parts via Java. Does anyone know a regex or a library to do this? To split it like t
I’d start from the back since, as far as I know, a city name cannot contain numbers (but it can contain spaces (first example I’ve found: “Weil der Stadt”). Then the five-digit number before that must be the zip code.
The number (possibly followed by a single letter) before that is the street number. Note that this can also be a range. Anything before that is the street name.
Anyway, here we go:
^((?:\p{L}| |\d|\.|-)+?) (\d+(?: ?- ?\d+)? *[a-zA-Z]?) (\d{5}) ((?:\p{L}| |-)+)(?: *\(([^\)]+)\))?$
This correctly parses even arcane addresses such as “Straße des 17. Juni 23-25 a 12345 Berlin-Mitte”.
Note that this doesn’t work with address extensions (such as “Gartenhaus” or “c/o …”). I have no clue how to handle those. I rather doubt that there’s a viable regular expression to express all this.
As you can see, this is a quite complex regular expression with lots of capture groups. If I would use such an expression in code, I would use named captures (Java 7 supports them) and break the expression up into smaller morsels using the x
flag. Unfortunately, Java doesn’t support this. This s*cks because it effectively renders complex regular expressions unusable.
Still, here’s a somewhat more legible regular expression:
^
(?(?:\p{L}|\ |\d|\.|-)+?)\
(?\d+(?:\ ?-\ ?\d+)?\ *[a-zA-Z]?)\
(?\d{5})\
(?(?:\p{L}|\ |-)+)
(?:\ *\((?[^\)]+)\))?
$
In Java 7, the closest we can achieve is this (untested; may contain typos):
String pattern =
"^" +
"(?(?:\\p{L}| |\\d|\\.|-)+?) " +
"(?\\d+(?: ?- ?\\d+)? *[a-zA-Z]?) " +
"(?\\d{5}) " +
"(?(?:\\p{L}| |-)+)" +
"(?: *\\((?[^\\)]+)\\))?" +
"$";