We do business largely in the United States and are trying to improve user experience by combining all the address fields into a single text area. But there are a few proble
UPDATE: Geocode.xyz now works worldwide. For examples see https://geocode.xyz
For USA, Mexico and Canada, see geocoder.ca.
For example:
Input: something going on near the intersection of main and arthur kill rd new york
Output:
<geodata> <latt>40.5123510000</latt> <longt>-74.2500500000</longt> <AreaCode>347,718</AreaCode> <TimeZone>America/New_York</TimeZone> <standard> <street1>main</street1> <street2>arthur kill</street2> <stnumber/> <staddress/> <city>STATEN ISLAND</city> <prov>NY</prov> <postal>11385</postal> <confidence>0.9</confidence> </standard> </geodata>
You may also check the results in the web interface or get output as Json or Jsonp. eg. I'm looking for restaurants around 123 Main Street, New York
libpostal: an open-source library to parse addresses, training with data from OpenStreetMap, OpenAddresses and OpenCage.
https://github.com/openvenues/libpostal (more info about it)
Other tools/services:
http://www.gisgraphy.com Free, open source, and ready to use geocoder and geolocalisation webservices, integrating OpenStreetMap, GeoNames and Quattroshapes.
https://github.com/kodapan/osm-common Library for accessing OpenStreetMap services, parsing and processing data.
http://wiki.openstreetmap.org/wiki/Nominatim
http://address-parser.net/
http://geoservices.tamu.edu/Services/AddressNormalization/
I'm late to the party, here is an Excel VBA script I wrote years ago for Australia. It can be easily modified to support other Countries. I've made a GitHub repository of the C# code here. I've hosted it on my site and you can download it here: http://jeremythompson.net/rocks/ParseAddress.xlsm
For any country with a PostCode that's numeric or can be matched with a RegEx my strategy works very well:
First we detect the First and Surname which are assumed to be the top line. Its easy to skip the name and start with the address by unticking the checkbox (called 'Name is top row' as shown below).
Next its safe to expect the Address consisting of the Street and Number come before the Suburb and the St, Pde, Ave, Av, Rd, Cres, loop, etc is a separator.
Detecting the Suburb vs the State and even Country can trick the most sophisticated parsers as there can be conflicts. To overcome this I use a PostCode look up based on the fact that after stripping Street and Apartment/Unit numbers as well as the PoBox,Ph,Fax,Mobile etc, only the PostCode number will remain. This is easy to match with a regEx to then look up the suburb(s) and country.
Your National Post Office Service will provide a list of post codes with Suburbs and States free of charge that you can store in an excel sheet, db table, text/json/xml file, etc.
DISCLAIMER, I know this code is not perfect, or even written well however its very easy to convert to any programming language and run in any type of application.The strategy is the answer depending on your country and rules, take this code as an example:
Option Explicit
Private Const TopRow As Integer = 0
Public Sub ParseAddress()
Dim strArr() As String
Dim sigRow() As String
Dim i As Integer
Dim j As Integer
Dim k As Integer
Dim Stat As String
Dim SpaceInName As Integer
Dim Temp As String
Dim PhExt As String
On Error Resume Next
Temp = ActiveSheet.Range("Address")
'Split info into array
strArr = Split(Temp, vbLf)
'Trim the array
For i = 0 To UBound(strArr)
strArr(i) = VBA.Trim(strArr(i))
Next i
'Remove empty items/rows
ReDim sigRow(LBound(strArr) To UBound(strArr))
For i = LBound(strArr) To UBound(strArr)
If Trim(strArr(i)) <> "" Then
sigRow(j) = strArr(i)
j = j + 1
End If
Next i
ReDim Preserve sigRow(LBound(strArr) To j)
'Find the name (MUST BE ON THE FIRST ROW UNLESS CHECKBOX UNTICKED)
i = TopRow
If ActiveSheet.Shapes("chkFirst").ControlFormat.Value = 1 Then
SpaceInName = InStr(1, sigRow(i), " ", vbTextCompare) - 1
If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
Else
If MsgBox("First Name: " & VBA.Mid$(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
End If
If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
Else
If MsgBox("Surame: " & VBA.Mid(sigRow(i), SpaceInName + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
End If
sigRow(i) = ""
End If
'Find the Street by looking for a "St, Pde, Ave, Av, Rd, Cres, loop, etc"
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
For j = 0 To 8
If InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) > 0 Then
'Find the position of the street in order to get the suburb
SpaceInName = InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) + Len(Street(j)) - 1
'If its a po box then add 5 chars
If VBA.Right(Street(j), 3) = "BOX" Then SpaceInName = SpaceInName + 5
If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
Else
If MsgBox("Street Address: " & VBA.Mid(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
End If
'Trim the Street, Number leaving the Suburb if its exists on the same line
sigRow(i) = VBA.Mid(sigRow(i), SpaceInName) + 2
sigRow(i) = Replace(sigRow(i), VBA.Mid(sigRow(i), 1, SpaceInName), "")
GoTo PastAddress:
End If
Next j
End If
Next i
PastAddress:
'Mobile
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
For j = 0 To 3
Temp = Mb(j)
If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then
If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
Else
If MsgBox("Mobile: " & VBA.Mid(sigRow(i), Len(Temp) + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
End If
sigRow(i) = ""
GoTo PastMobile:
End If
Next j
End If
Next i
PastMobile:
'Phone
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
For j = 0 To 1
Temp = Ph(j)
If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then
'TODO: Detect the intl or national extension here.. or if we can from the postcode.
If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
Else
If MsgBox("Phone: " & VBA.Mid(sigRow(i), Len(Temp) + 3), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
End If
sigRow(i) = ""
GoTo PastPhone:
End If
Next j
End If
Next i
PastPhone:
'Email
For i = 1 To UBound(sigRow)
If Len(sigRow(i)) > 0 Then
'replace with regEx search
If InStr(1, sigRow(i), "@", vbTextCompare) And InStr(1, VBA.UCase(sigRow(i)), ".CO", vbTextCompare) Then
Dim email As String
email = sigRow(i)
email = Replace(VBA.UCase(email), "EMAIL:", "")
email = Replace(VBA.UCase(email), "E-MAIL:", "")
email = Replace(VBA.UCase(email), "E:", "")
email = Replace(VBA.UCase(Trim(email)), "E ", "")
email = VBA.LCase(email)
If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("Email") = email
Else
If MsgBox("Email: " & email, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Email") = email
End If
sigRow(i) = ""
Exit For
End If
End If
Next i
'Now the only remaining items will be the postcode, suburb, country
'there shouldn't be any numbers (eg. from PoBox,Ph,Fax,Mobile) except for the Post Code
'Join the string and filter out the Post Code
Temp = Join(sigRow, vbCrLf)
Temp = Trim(Temp)
For i = 1 To Len(Temp)
Dim postCode As String
postCode = VBA.Mid(Temp, i, 4)
'In Australia PostCodes are 4 digits
If VBA.Mid(Temp, i, 1) <> " " And IsNumeric(postCode) Then
If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
ActiveSheet.Range("PostCode") = postCode
Else
If MsgBox("Post Code: " & postCode, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("PostCode") = postCode
End If
'Lookup the Suburb and State based on the PostCode, the PostCode sheet has the lookup
Dim mySuburbArray As Range
Set mySuburbArray = Sheets("PostCodes").Range("A2:B16670")
Dim suburbs As String
For j = 1 To mySuburbArray.Columns(1).Cells.Count
If mySuburbArray.Cells(j, 1) = postCode Then
'Check if the suburb is listed in the address
If InStr(1, UCase(Temp), mySuburbArray.Cells(j, 2), vbTextCompare) > 0 Then
'Set the Suburb and State
ActiveSheet.Range("Suburb") = mySuburbArray.Cells(j, 2)
Stat = mySuburbArray.Cells(j, 3)
ActiveSheet.Range("State") = Stat
'Knowing the State - for Australia we can get the telephone Ext
PhExt = PhExtension(VBA.UCase(Stat))
ActiveSheet.Range("PhExt") = PhExt
'remove the phone extension from the number
Dim prePhone As String
prePhone = ActiveSheet.Range("Phone")
prePhone = Replace(prePhone, PhExt & " ", "")
prePhone = Replace(prePhone, "(" & PhExt & ") ", "")
prePhone = Replace(prePhone, "(" & PhExt & ")", "")
ActiveSheet.Range("Phone") = prePhone
Exit For
End If
End If
Next j
Exit For
End If
Next i
End Sub
Private Function PhExtension(ByVal State As String) As String
Select Case State
Case Is = "NSW"
PhExtension = "02"
Case Is = "QLD"
PhExtension = "07"
Case Is = "VIC"
PhExtension = "03"
Case Is = "NT"
PhExtension = "04"
Case Is = "WA"
PhExtension = "05"
Case Is = "SA"
PhExtension = "07"
Case Is = "TAS"
PhExtension = "06"
End Select
End Function
Private Function Ph(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Ph = "PH"
Case Is = 1
Ph = "PHONE"
'Case Is = 2
'Ph = "P"
End Select
End Function
Private Function Mb(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Mb = "MB"
Case Is = 1
Mb = "MOB"
Case Is = 2
Mb = "CELL"
Case Is = 3
Mb = "MOBILE"
'Case Is = 4
'Mb = "M"
End Select
End Function
Private Function Fax(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Fax = "FAX"
Case Is = 1
Fax = "FACSIMILE"
'Case Is = 2
'Fax = "F"
End Select
End Function
Private Function State(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
State = "NSW"
Case Is = 1
State = "QLD"
Case Is = 2
State = "VIC"
Case Is = 3
State = "NT"
Case Is = 4
State = "WA"
Case Is = 5
State = "SA"
Case Is = 6
State = "TAS"
End Select
End Function
Private Function Street(ByVal Num As Integer) As String
Select Case Num
Case Is = 0
Street = " ST"
Case Is = 1
Street = " RD"
Case Is = 2
Street = " AVE"
Case Is = 3
Street = " AV"
Case Is = 4
Street = " CRES"
Case Is = 5
Street = " LOOP"
Case Is = 6
Street = "PO BOX"
Case Is = 7
Street = " STREET"
Case Is = 8
Street = " ROAD"
Case Is = 9
Street = " AVENUE"
Case Is = 10
Street = " CRESENT"
Case Is = 11
Street = " PARADE"
Case Is = 12
Street = " PDE"
Case Is = 13
Street = " LANE"
Case Is = 14
Street = " COURT"
Case Is = 15
Street = " BLVD"
Case Is = 16
Street = "P.O. BOX"
Case Is = 17
Street = "P.O BOX"
Case Is = 18
Street = "PO BOX"
Case Is = 19
Street = "POBOX"
End Select
End Function
I saw this question a lot when I worked for an address verification company. I'm posting the answer here to make it more accessible to programmers who are searching around with the same question. The company I was at processed billions of addresses, and we learned a lot in the process.
First, we need to understand a few things about addresses.
This means that regular expressions are out. I've seen it all, from simple regular expressions that match addresses in a very specific format, to this:
/\s+(\d{2,5}\s+)(?![a|p]m\b)(([a-zA-Z|\s+]{1,5}){1,2})?([\s|,|.]+)?(([a-zA-Z|\s+]{1,30}){1,4})(court|ct|street|st|drive|dr|lane|ln|road|rd|blvd)([\s|,|.|;]+)?(([a-zA-Z|\s+]{1,30}){1,2})([\s|,|.]+)?\b(AK|AL|AR|AZ|CA|CO|CT|DC|DE|FL|GA|GU|HI|IA|ID|IL|IN|KS|KY|LA|MA|MD|ME|MI|MN|MO|MS|MT|NC|ND|NE|NH|NJ|NM|NV|NY|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VA|VI|VT|WA|WI|WV|WY)([\s|,|.]+)?(\s+\d{5})?([\s|,|.]+)/i
... to this where a 900+ line-class file generates a supermassive regular expression on the fly to match even more. I don't recommend these (for example, here's a fiddle of the above regex, that makes plenty of mistakes). There isn't an easy magic formula to get this to work. In theory and by theory, it's not possible to match addresses with a regular expression.
USPS Publication 28 documents the many formats of addresses that are possible, with all their keywords and variatons. Worst of all, addresses are often ambiguous. Words can mean more than one thing ("St" can be "Saint" or "Street") and there are words that I'm pretty sure they invented. (Who knew that "Stravenue" was a street suffix?)
You'd need some code that really understands addresses, and if that code does exist, it's a trade secret. But you could probably roll your own if you're really into that.
Here are some contrived (but complete) addresses:
1) 102 main street
Anytown, state
2) 400n 600e #2, 52173
3) p.o. #104 60203
Even these are possibly valid:
4) 829 LKSDFJlkjsdflkjsdljf Bkpw 12345
5) 205 1105 14 90210
Obviously, these are not standardized. Punctuation and line breaks not guaranteed. Here's what's going on:
Number 1 is complete because it contains a street address and a city and state. With that information, there's enough identify the address, and it can be considered "deliverable" (with some standardization).
Number 2 is complete because it also contains a street address (with secondary/unit number) and a 5-digit ZIP code, which is enough to identify an address.
Number 3 is a complete post office box format, as it contains a ZIP code.
Number 4 is also complete because the ZIP code is unique, meaning that a private entity or corporation has purchased that address space. A unique ZIP code is for high-volume or concentrated delivery spaces. Anything addressed to ZIP code 12345 goes to General Electric in Schenectady, NY. This example won't reach anyone in particular, but the USPS would still be able to deliver it.
Number 5 is also complete, believe it or not. With just those numbers, the full address can be discovered when parsed against a database of all possible addresses. Filling in the missing directionals, secondary designator, and ZIP+4 code is trivial when you see each number as a component. Here's what it looks like, fully expanded and standardized:
205 N 1105 W Apt 14
Beverly Hills CA 90210-5221
In most countries that provide official address data to licensed vendors, the address data itself belongs to the governing agency. In the US, the USPS owns the addresses. The same is true for Canada Post, Royal Mail, and others, though each country enforces or defines ownership a little differently. Knowing this is important, since it usually forbids reverse-engineering the address database. You have to be careful how to acquire, store, and use the data.
Google Maps is a common go-to for quick address fixes, but the TOS is rather prohibitive; for example, you can't use their data or APIs without showing a Google Map, and for non-commerical purposes only (unless you pay), and you can't store the data (except for temporary caching). Makes sense. Google's data is some of the best in the world. However, Google Maps does not verify the address. If an address does not exist, it will still show you where the address would be if it did exist (try it on your own street; use a house number that you know doesn't exist). This is useful sometimes, but be aware of that.
Nominatim's usage policy is similarly limiting, especially for high volume and commercial use, and the data is mostly drawn from free sources, so it isn't as well maintained (such is the nature of open projects) -- however, this may still suit your needs. It is supported by a great community.
The USPS itself has an API, but it goes down a lot and comes with no guarantees nor support. It might also be hard to use. Some people use it sparingly with no problems. But it's easy to miss that the USPS requires that you use their API only for confirming addresses to ship through them.
Unfortunately, we've conditioned our society to expect addresses to be complicated. There's dozens of good UX articles all over the Internet about this, but the fact is, if you have an address form with individual fields, that's what users expect, even though it makes it harder for edge-case addresses that don't fit the format the form is expecting, or maybe the form requires a field it shouldn't. Or users don't know where to put a certain part of their address.
I could go on and on about the bad UX of checkout forms these days, but instead I'll just say that combining the addresses into a single field will be a welcome change -- people will be able to type their address how they see fit, rather than trying to figure out your lengthy form. However, this change will be unexpected and users may find it a little jarring at first. Just be aware of that.
Part of this pain can be alleviated by putting the country field out front, before the address. When they fill out the country field first, you know how to make your form appear. Maybe you have a good way to deal with single-field US addresses, so if they select United States, you can reduce your form to a single field, otherwise show the component fields. Just things to think about!
The USPS licenses vendors through a process called CASS™ Certification to provide verified addresses to customers. These vendors have access to the USPS database, updated monthly. Their software must conform to rigorous standards to be certified, and they don't often require agreement to such limiting terms as discussed above.
There are many CASS-Certified companies that can process lists or have APIs: Melissa Data, Experian QAS, and SmartyStreets to name a few.
(Due to getting flak for "advertising" I've truncated my answer at this point. It's up to you to find a solution that works for you.)
The Truth: Really, folks, I don't work at any of these companies. It's not an advertisement.
If you want to rely on OSM data libpostal is very powerful and handles a lot of the most common caveats with address inputs.
For US Address Parsing,
I prefer using usaddress package that is available in pip for usaddress only
python3 -m pip install usaddress
Documentation
PyPi
This worked well for me for US address.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# address_parser.py
import sys
from usaddress import tag
from json import dumps, loads
if __name__ == '__main__':
tag_mapping = {
'Recipient': 'recipient',
'AddressNumber': 'addressStreet',
'AddressNumberPrefix': 'addressStreet',
'AddressNumberSuffix': 'addressStreet',
'StreetName': 'addressStreet',
'StreetNamePreDirectional': 'addressStreet',
'StreetNamePreModifier': 'addressStreet',
'StreetNamePreType': 'addressStreet',
'StreetNamePostDirectional': 'addressStreet',
'StreetNamePostModifier': 'addressStreet',
'StreetNamePostType': 'addressStreet',
'CornerOf': 'addressStreet',
'IntersectionSeparator': 'addressStreet',
'LandmarkName': 'addressStreet',
'USPSBoxGroupID': 'addressStreet',
'USPSBoxGroupType': 'addressStreet',
'USPSBoxID': 'addressStreet',
'USPSBoxType': 'addressStreet',
'BuildingName': 'addressStreet',
'OccupancyType': 'addressStreet',
'OccupancyIdentifier': 'addressStreet',
'SubaddressIdentifier': 'addressStreet',
'SubaddressType': 'addressStreet',
'PlaceName': 'addressCity',
'StateName': 'addressState',
'ZipCode': 'addressPostalCode',
}
try:
address, _ = tag(' '.join(sys.argv[1:]), tag_mapping=tag_mapping)
except:
with open('failed_address.txt', 'a') as fp:
fp.write(sys.argv[1] + '\n')
print(dumps({}))
else:
print(dumps(dict(address)))
Running the address_parser.py
python3 address_parser.py 9757 East Arcadia Ave. Saugus MA 01906
{"addressStreet": "9757 East Arcadia Ave.", "addressCity": "Saugus", "addressState": "MA", "addressPostalCode": "01906"}