How do I find duplicate addresses in a database, or better stop people already when filling in the form ? I guess the earlier the better?
Is there any good way of abstra
I realize that the original post is specif to German addresses, but this is a good questions for addresses in general.
In the United States, there is a part of an address called a delivery point barcode. It's a unique 12-digit number that identifies a single point of delivery and can serve as the unique identifier of an address. To get this value you'll want to use an address verification or address standardization web service API, which can cost about $20/mo depending upon the volume of requests you make to it.
In the interest of full disclosure, I'm the founder of SmartyStreets. We offer just such an address validation web service API called LiveAddress. You're more than welcome to contact me personally with any questions you have.
You could use the Google GeoCode API
Wich in fact gives results for both of your examples, just tried it. That way you get structured results that you can save in your database. If the lookup fails, ask the user to write the address in another way.
Often you use constraints in a database to ensure data to be "unique" in the data-based sense.
Regarding "isomorphisms" I think you are on your own, ie writing the code your self. If in the database you could use a trigger.
I'm looking for an answer addressing United States addresses
The issue in question is prevent users from entering duplicates like
Quellenstrasse 66/11
andQuellenstr. 66a-11
This happens when you let your user enter the complete address in input box.
There are some methods you can use to prevent this.
From Google Developer's guide,
The term geocoding generally refers to translating a human-readable address into a location on a map. The process of doing the opposite, translating a location on the map into a human-readable address, is known as reverse geocoding.
And finally
This is efficient even the number of test cases may high, the number of entries you test against will be very less and so it will consume very less amount of time.
In the USA, you can use USPS Address Standardization Web Tool. It verifies and normalizes addresses for you. This way, you can normalize the address before checking if it already exists in the database. If all the addresses in the database are already normalized, you'll be able to spot duplicates easily.
Sample URL:
https://production.shippingapis.com/ShippingAPI.dll?API=Verify&XML=insert_request_XML_here
Sample request:
<AddressValidateRequest USERID="XXXXX">
<IncludeOptionalElements>true</IncludeOptionalElements>
<ReturnCarrierRoute>true</ReturnCarrierRoute>
<Address ID="0">
<FirmName />
<Address1 />
<Address2>205 bagwell ave</Address2>
<City>nutter fort</City>
<State>wv</State>
<Zip5></Zip5>
<Zip4></Zip4>
</Address>
</AddressValidateRequest>
Sample response:
<AddressValidateResponse>
<Address ID="0">
<Address2>205 BAGWELL AVE</Address2>
<City>NUTTER FORT</City>
<State>WV</State>
<Zip5>26301</Zip5>
<Zip4>4322</Zip4>
<DeliveryPoint>05</DeliveryPoint>
<CarrierRoute>C025</CarrierRoute>
</Address>
</AddressValidateResponse>
Other countries might have their own APIs. Other people mentioned 3rd party APIs that support multiple countries that might be useful in some cases.
To add an answer to my own question:
A different way of doing it is ask users for their mobile phone number, send them a text msg for verification. This stops most people messing with duplicate addresses.
I'm talking from personal experience. (thanks pigsback !) They introduced confirmation through mobile phone. That stopped me having 2 accounts! :-)