Regex for splitting a german address into its parts

前端 未结 6 751
孤街浪徒
孤街浪徒 2021-02-09 18:09

Good evening,

I\'m trying to splitting the parts of a german address string into its parts via Java. Does anyone know a regex or a library to do this? To split it like t

相关标签:
6条回答
  • 2021-02-09 18:25

    I came across a similar problem and tweaked the solutions provided here a little bit and came to this solution which also works but (imo) is a little bit simpler to understand and to extend:

    /^([a-zäöüß\s\d.,-]+?)\s*([\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)?\s*(\d{5})\s*(.+)?$/i
    

    Here are some example matches.

    It can also handle missing street numbers and is easily extensible by adding special characters to the character classes.

    [a-zäöüß\s\d,.-]+?                         # Street name (lazy)
    [\d\s]+(?:\s?[-|+/]\s?\d+)?\s*[a-z]?)?     # Street number (optional)
    

    After that, there has to be the zip code, which is the only part that is absolutely necessary because it's the only constant part. Everything after the zipcode is considered as the city name.

    0 讨论(0)
  • 2021-02-09 18:31

    try this:

    ^[^\d]+[\d\w]+(\s)\d+(\s).*$
    

    It captures groups for each of the spaces that delimits 1 of the 4 sections of the address

    OR

    this one gives you groups for each of the address parts:

    ^([^\d]+)([\d\w]+)\s(\d+)\s(.*)$
    

    I don't know java, so not sure the exact code to use for replacing captured groups.

    0 讨论(0)
  • 2021-02-09 18:33
    public static void main(String[] args) {
        String data = "Name der Strase 25a 88489 Teststadt";
        String regexp = "([ a-zA-z]+) ([\\w]+) (\\d+) ([a-zA-Z]+)";
    
        Pattern pattern = Pattern.compile(regexp);
        Matcher matcher = pattern.matcher(data);
        boolean matchFound = matcher.find();
    
        if (matchFound) {
            // Get all groups for this match
            for (int i=0; i<=matcher.groupCount(); i++) {
                String groupStr = matcher.group(i);
                System.out.println(groupStr);
            }
        }System.out.println("nothing found");
                    }
    

    I guess it doesn't work with german umlauts but you can fix this on your own. Anyway it's a good startup.

    I recommend to visit this it's a great site about regular expressions. Good luck!

    0 讨论(0)
  • 2021-02-09 18:38

    At first glance it looks like a simple whitespace would do it, however looking closer I notice the address always has 4 parts, and the first part can have whitespace.

    What I would do is something like this (psudeocode):

    address[4] = empty
    split[?] = address_string.split(" ")
    address[3] = split[last]
    address[2] = split[last - 1]
    address[1] = split[last - 2]
    address[0] = join split[first] through split[last - 3] with whitespace, trim trailing whitespace with trim()
    

    However, this will only handle one form of address. If addresses are written multiple ways it could be much more tricky.

    0 讨论(0)
  • 2021-02-09 18:41

    Here is my suggestion which could be fine-tuned further e.g. to allow missing parts.

    Regex Pattern:

    ^([^0-9]+) ([0-9]+.*?) ([0-9]{5}) (.*)$
    
    • Group 1: Street
    • Group 2: House no.
    • Group 3: ZIP
    • Group 4: City
    0 讨论(0)
  • 2021-02-09 18:46

    I’d start from the back since, as far as I know, a city name cannot contain numbers (but it can contain spaces (first example I’ve found: “Weil der Stadt”). Then the five-digit number before that must be the zip code.

    The number (possibly followed by a single letter) before that is the street number. Note that this can also be a range. Anything before that is the street name.

    Anyway, here we go:

    ^((?:\p{L}| |\d|\.|-)+?) (\d+(?: ?- ?\d+)? *[a-zA-Z]?) (\d{5}) ((?:\p{L}| |-)+)(?: *\(([^\)]+)\))?$
    

    This correctly parses even arcane addresses such as “Straße des 17. Juni 23-25 a 12345 Berlin-Mitte”.

    Note that this doesn’t work with address extensions (such as “Gartenhaus” or “c/o …”). I have no clue how to handle those. I rather doubt that there’s a viable regular expression to express all this.

    As you can see, this is a quite complex regular expression with lots of capture groups. If I would use such an expression in code, I would use named captures (Java 7 supports them) and break the expression up into smaller morsels using the x flag. Unfortunately, Java doesn’t support this. This s*cks because it effectively renders complex regular expressions unusable.

    Still, here’s a somewhat more legible regular expression:

    ^
    (?<street>(?:\p{L}|\ |\d|\.|-)+?)\ 
    (?<number>\d+(?:\ ?-\ ?\d+)?\ *[a-zA-Z]?)\ 
    (?<zip>\d{5})\ 
    (?<city>(?:\p{L}|\ |-)+)
    (?:\ *\((?<suffix>[^\)]+)\))?
    $
    

    In Java 7, the closest we can achieve is this (untested; may contain typos):

    String pattern =
        "^" +
        "(?<street>(?:\\p{L}| |\\d|\\.|-)+?) " +
        "(?<number>\\d+(?: ?- ?\\d+)? *[a-zA-Z]?) " +
        "(?<zip>\\d{5}) " +
        "(?<city>(?:\\p{L}| |-)+)" +
        "(?: *\\((?<suffix>[^\\)]+)\\))?" +
        "$";
    
    0 讨论(0)
提交回复
热议问题