How do I find duplicate addresses in a database, or better stop people already when filling in the form ? I guess the earlier the better?
Is there any good way of abstra
The earlier you can stop people, the easier it'll be in the long run!
Not being too familiar with your db schema or data entry form, I'd suggest a route something like the following:
have distinct fields in your db for each address "part", e.g. street, city, postal code, Länder, etc.
have your data entry form broken down similarly, e.g. street, city, etc
The reasoning behind the above is that each part will likely have it's own particular "rules" for checking slightly-changed addressed, ("Quellenstrasse"->"Quellenstr.", "66/11"->"66a-11" above) so your validation code can check if the values as presented for each field exist in their respective db field. If not, you can have a class that applies the transformation rules for each given field (e.g. "strasse" stemmed to "str") and checks again for duplicates.
Obviously the above method has it's drawbacks:
it can be slow, depending on your data set, leaving the user waiting
users may try to get around it by putting address "Parts" in the wrong fields (appending post code to city, etc). but from experience we've found that introducing even simple checking like the above will prevent a large percentage of users from entering pre-existing addresses.
Once you've the basic checking in place, you can look at optimising the db accesses required, refining the rules, etc to meet your particular schema. You might also take a look at MySQL's match() function for working out similar text.
Before you start searching for duplicate addresses in your database, you should first make sure you store the addresses in a standard format.
Most countries have a standard way of formatting addresses, in the US it's the USPS CASS system: http://www.usps.com/ncsc/addressservices/certprograms/cass.htm
But most other countries have a similar service/standard. Try this site for more international formats: http://bitboost.com/ref/international-address-formats.html
This not only helps in finding duplicates, but also saves you money when mailing you customers (the postal service charges less if the address is in a standard format).
Depending on your application, in some cases you might want to store a "vanity" address record as well as the standard address record. This keeps your VIP customers happy. A "vanity" address might be something like:
62 West Ninety First Street
Apartment 4D
Manhattan, New York, NY 10001
While the standard address might look like this:
62 W 91ST ST APT 4D
NEW YORK NY 10024-1414
Another possible solution (assuming you actually need reliable address data and you're not just using addresses as a way to prevent duplicate accounts) is to use a third-party web service to standardize the addresses provided by your users.
It works this way -- your system accepts a user's address via an online form. Your form hands off the user's address to the third-party address standardization web service. The web service gives you back the same address but now with the data standardized into discrete address fields, and with the standard abbreviations and formats applied. Your application displays this standardized address to your user for their confirmation before attempting to save the data in your DB.
If all the user addresses go through a standardization step and only standardized addresses are saved to your DB, then finding duplicate records should be greatly simplified since you are now comparing apples to apples.
One such third-party service is Global Address's Interactive Service which includes Germany in the list of supported countries, and also has an online demo that demonstrates how their service works (demo link can be found on that web page).
There's a cost disadvantage to this approach, obviously. However, on the plus side:
Disclaimer: I don't work for Global Address and have not tried using their service. I'm merely mentioning them as an example since they have an online demo that you can actually play with.
Machine learning and AI has algorithms to find string similarities and duplicate measures.
Record linkage or the task of matching equivalent records that differ syntactically—was first explored in the late 1950s and 1960s.
You can represent every pair of records using a vector of features that describe the similarity between individual record fields.
For example, Adaptive Duplicate Detection Using Learnable String Similarity Measures. for example, read this doc
You can use generic or manually tuned distance metrics for estimating the similarity of potential duplicates.
You can use adaptive name matching algorithms, like Jaro metric, which is based on the number and order of common characters between two strings.
Token-based and hybrid distance. In such cases, we can convert the strings s and t to token multisets (where each token is a word) and consider similarity metrics on these multisets.
Johannes:
@PConroy: This was my initial thougt also. the interesting part on this is to find good transformation rules for the different parts of the address! Any good suggestions?
When we were working on this type of project before, our approach was to take our existing corpus of addresses (150k or so), then apply the most common transformations for our domain (Ireland, so "Dr"->"Drive", "Rd"->"Road", etc). I'm afraid there was no comprehensive online resource for such things at the time, so we ended up basically coming up with a list ourselves, checking things like the phone book (pressed for space there, addresses are abbreviated in all manner of ways!). As I mentioned earlier, you'd be amazed how many "duplicates" you'll detect with the addition of only a few common rules!
I've recently stumbled across a page with a fairly comprehensive list of address abbreviations, although it's american english, so I'm not sure how useful it'd be in Germany! A quick google turned up a couple of sites, but they seemed like spammy newsletter sign-up traps. Although that was me googling in english, so you may have more look with "german address abbreviations" in german :)
One thing you might want to look at are Soundex searches, which are quite useful for misspellings and contractions.
This however is not an in-database validation so it may or may not be what you're looking for.