How to parse freeform street/postal address out of text, and into components

后端 未结 9 1040
感动是毒
感动是毒 2020-11-22 13:40

We do business largely in the United States and are trying to improve user experience by combining all the address fields into a single text area. But there are a few proble

相关标签:
9条回答
  • 2020-11-22 14:13

    UPDATE: Geocode.xyz now works worldwide. For examples see https://geocode.xyz

    For USA, Mexico and Canada, see geocoder.ca.

    For example:

    Input: something going on near the intersection of main and arthur kill rd new york

    Output:

    <geodata>
      <latt>40.5123510000</latt>
      <longt>-74.2500500000</longt>
      <AreaCode>347,718</AreaCode>
      <TimeZone>America/New_York</TimeZone>
      <standard>
        <street1>main</street1>
        <street2>arthur kill</street2>
        <stnumber/>
        <staddress/>
        <city>STATEN ISLAND</city>
        <prov>NY</prov>
        <postal>11385</postal>
        <confidence>0.9</confidence>
      </standard>
    </geodata>
    

    You may also check the results in the web interface or get output as Json or Jsonp. eg. I'm looking for restaurants around 123 Main Street, New York

    0 讨论(0)
  • 2020-11-22 14:14

    libpostal: an open-source library to parse addresses, training with data from OpenStreetMap, OpenAddresses and OpenCage.

    https://github.com/openvenues/libpostal (more info about it)

    Other tools/services:

    • http://www.gisgraphy.com Free, open source, and ready to use geocoder and geolocalisation webservices, integrating OpenStreetMap, GeoNames and Quattroshapes.

    • https://github.com/kodapan/osm-common Library for accessing OpenStreetMap services, parsing and processing data.

    • http://wiki.openstreetmap.org/wiki/Nominatim

    • http://address-parser.net/

    • http://geoservices.tamu.edu/Services/AddressNormalization/

    0 讨论(0)
  • 2020-11-22 14:14

    I'm late to the party, here is an Excel VBA script I wrote years ago for Australia. It can be easily modified to support other Countries. I've made a GitHub repository of the C# code here. I've hosted it on my site and you can download it here: http://jeremythompson.net/rocks/ParseAddress.xlsm

    Strategy

    For any country with a PostCode that's numeric or can be matched with a RegEx my strategy works very well:

    1. First we detect the First and Surname which are assumed to be the top line. Its easy to skip the name and start with the address by unticking the checkbox (called 'Name is top row' as shown below).

    2. Next its safe to expect the Address consisting of the Street and Number come before the Suburb and the St, Pde, Ave, Av, Rd, Cres, loop, etc is a separator.

    3. Detecting the Suburb vs the State and even Country can trick the most sophisticated parsers as there can be conflicts. To overcome this I use a PostCode look up based on the fact that after stripping Street and Apartment/Unit numbers as well as the PoBox,Ph,Fax,Mobile etc, only the PostCode number will remain. This is easy to match with a regEx to then look up the suburb(s) and country.

    Your National Post Office Service will provide a list of post codes with Suburbs and States free of charge that you can store in an excel sheet, db table, text/json/xml file, etc.

    1. Finally, since some Post Codes have multiple Suburbs we check which suburb appears in the Address.

    Example

    VBA Code

    DISCLAIMER, I know this code is not perfect, or even written well however its very easy to convert to any programming language and run in any type of application.The strategy is the answer depending on your country and rules, take this code as an example:

    Option Explicit
    
    Private Const TopRow As Integer = 0
    
    Public Sub ParseAddress()
    Dim strArr() As String
    Dim sigRow() As String
    Dim i As Integer
    Dim j As Integer
    Dim k As Integer
    Dim Stat As String
    Dim SpaceInName As Integer
    Dim Temp As String
    Dim PhExt As String
    
    On Error Resume Next
    
    Temp = ActiveSheet.Range("Address")
    
    'Split info into array
    strArr = Split(Temp, vbLf)
    
    'Trim the array
    For i = 0 To UBound(strArr)
    strArr(i) = VBA.Trim(strArr(i))
    Next i
    
    'Remove empty items/rows    
    ReDim sigRow(LBound(strArr) To UBound(strArr))
    For i = LBound(strArr) To UBound(strArr)
        If Trim(strArr(i)) <> "" Then
            sigRow(j) = strArr(i)
            j = j + 1
        End If
    Next i
    ReDim Preserve sigRow(LBound(strArr) To j)
    
    'Find the name (MUST BE ON THE FIRST ROW UNLESS CHECKBOX UNTICKED)
    i = TopRow
    If ActiveSheet.Shapes("chkFirst").ControlFormat.Value = 1 Then
    
    SpaceInName = InStr(1, sigRow(i), " ", vbTextCompare) - 1
    
    If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
    ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
    Else
     If MsgBox("First Name: " & VBA.Mid$(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("FirstName") = VBA.Left(sigRow(i), SpaceInName)
    End If
    
    If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
    ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
    Else
      If MsgBox("Surame: " & VBA.Mid(sigRow(i), SpaceInName + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Surname") = VBA.Mid(sigRow(i), SpaceInName + 2)
    End If
    sigRow(i) = ""
    End If
    
    'Find the Street by looking for a "St, Pde, Ave, Av, Rd, Cres, loop, etc"
    For i = 1 To UBound(sigRow)
    If Len(sigRow(i)) > 0 Then
        For j = 0 To 8
        If InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) > 0 Then
    
        'Find the position of the street in order to get the suburb
        SpaceInName = InStr(1, VBA.UCase(sigRow(i)), Street(j), vbTextCompare) + Len(Street(j)) - 1
    
        'If its a po box then add 5 chars
        If VBA.Right(Street(j), 3) = "BOX" Then SpaceInName = SpaceInName + 5
    
        If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
        ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
        Else
          If MsgBox("Street Address: " & VBA.Mid(sigRow(i), 1, SpaceInName), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Street") = VBA.Mid(sigRow(i), 1, SpaceInName)
        End If
        'Trim the Street, Number leaving the Suburb if its exists on the same line
        sigRow(i) = VBA.Mid(sigRow(i), SpaceInName) + 2
        sigRow(i) = Replace(sigRow(i), VBA.Mid(sigRow(i), 1, SpaceInName), "")
    
        GoTo PastAddress:
        End If
        Next j
    End If
    Next i
    PastAddress:
    
    'Mobile
    For i = 1 To UBound(sigRow)
    If Len(sigRow(i)) > 0 Then
        For j = 0 To 3
        Temp = Mb(j)
            If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then
            If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
            ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
            Else
              If MsgBox("Mobile: " & VBA.Mid(sigRow(i), Len(Temp) + 2), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Mobile") = VBA.Mid(sigRow(i), Len(Temp) + 2)
            End If
        sigRow(i) = ""
        GoTo PastMobile:
        End If
        Next j
    End If
    Next i
    PastMobile:
    
    'Phone
    For i = 1 To UBound(sigRow)
    If Len(sigRow(i)) > 0 Then
        For j = 0 To 1
        Temp = Ph(j)
            If VBA.Left(VBA.UCase(sigRow(i)), Len(Temp)) = Temp Then
    
                'TODO: Detect the intl or national extension here.. or if we can from the postcode.
                If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
                ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
                Else
                  If MsgBox("Phone: " & VBA.Mid(sigRow(i), Len(Temp) + 3), vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Phone") = VBA.Mid(sigRow(i), Len(Temp) + 3)
                End If
    
            sigRow(i) = ""
            GoTo PastPhone:
            End If
        Next j
    End If
    Next i
    PastPhone:
    
    
    'Email
    For i = 1 To UBound(sigRow)
        If Len(sigRow(i)) > 0 Then
            'replace with regEx search
            If InStr(1, sigRow(i), "@", vbTextCompare) And InStr(1, VBA.UCase(sigRow(i)), ".CO", vbTextCompare) Then
            Dim email As String
            email = sigRow(i)
            email = Replace(VBA.UCase(email), "EMAIL:", "")
            email = Replace(VBA.UCase(email), "E-MAIL:", "")
            email = Replace(VBA.UCase(email), "E:", "")
            email = Replace(VBA.UCase(Trim(email)), "E ", "")
            email = VBA.LCase(email)
    
                If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
                ActiveSheet.Range("Email") = email
                Else
                  If MsgBox("Email: " & email, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("Email") = email
                End If
            sigRow(i) = ""
            Exit For
            End If
        End If
    Next i
    
    'Now the only remaining items will be the postcode, suburb, country
    'there shouldn't be any numbers (eg. from PoBox,Ph,Fax,Mobile) except for the Post Code
    
    'Join the string and filter out the Post Code
    Temp = Join(sigRow, vbCrLf)
    Temp = Trim(Temp)
    
    For i = 1 To Len(Temp)
    
    Dim postCode As String
    postCode = VBA.Mid(Temp, i, 4)
    
    'In Australia PostCodes are 4 digits
    If VBA.Mid(Temp, i, 1) <> " " And IsNumeric(postCode) Then
    
        If ActiveSheet.Shapes("chkConfirm").ControlFormat.Value = 0 Then
        ActiveSheet.Range("PostCode") = postCode
        Else
          If MsgBox("Post Code: " & postCode, vbQuestion + vbYesNo, "Confirm Details") = vbYes Then ActiveSheet.Range("PostCode") = postCode
        End If
    
        'Lookup the Suburb and State based on the PostCode, the PostCode sheet has the lookup
        Dim mySuburbArray As Range
        Set mySuburbArray = Sheets("PostCodes").Range("A2:B16670")
    
        Dim suburbs As String
        For j = 1 To mySuburbArray.Columns(1).Cells.Count
        If mySuburbArray.Cells(j, 1) = postCode Then
            'Check if the suburb is listed in the address
            If InStr(1, UCase(Temp), mySuburbArray.Cells(j, 2), vbTextCompare) > 0 Then
    
            'Set the Suburb and State
            ActiveSheet.Range("Suburb") = mySuburbArray.Cells(j, 2)
            Stat = mySuburbArray.Cells(j, 3)
            ActiveSheet.Range("State") = Stat
    
            'Knowing the State - for Australia we can get the telephone Ext
            PhExt = PhExtension(VBA.UCase(Stat))
            ActiveSheet.Range("PhExt") = PhExt
    
            'remove the phone extension from the number
            Dim prePhone As String
            prePhone = ActiveSheet.Range("Phone")
            prePhone = Replace(prePhone, PhExt & " ", "")
            prePhone = Replace(prePhone, "(" & PhExt & ") ", "")
            prePhone = Replace(prePhone, "(" & PhExt & ")", "")
            ActiveSheet.Range("Phone") = prePhone
            Exit For
            End If
        End If
        Next j
    Exit For
    End If
    Next i
    
    End Sub
    
    
    Private Function PhExtension(ByVal State As String) As String
    Select Case State
    Case Is = "NSW"
    PhExtension = "02"
    Case Is = "QLD"
    PhExtension = "07"
    Case Is = "VIC"
    PhExtension = "03"
    Case Is = "NT"
    PhExtension = "04"
    Case Is = "WA"
    PhExtension = "05"
    Case Is = "SA"
    PhExtension = "07"
    Case Is = "TAS"
    PhExtension = "06"
    End Select
    End Function
    
    Private Function Ph(ByVal Num As Integer) As String
    Select Case Num
    Case Is = 0
    Ph = "PH"
    Case Is = 1
    Ph = "PHONE"
    'Case Is = 2
    'Ph = "P"
    End Select
    End Function
    
    Private Function Mb(ByVal Num As Integer) As String
    Select Case Num
    Case Is = 0
    Mb = "MB"
    Case Is = 1
    Mb = "MOB"
    Case Is = 2
    Mb = "CELL"
    Case Is = 3
    Mb = "MOBILE"
    'Case Is = 4
    'Mb = "M"
    End Select
    End Function
    
    Private Function Fax(ByVal Num As Integer) As String
    Select Case Num
    Case Is = 0
    Fax = "FAX"
    Case Is = 1
    Fax = "FACSIMILE"
    'Case Is = 2
    'Fax = "F"
    End Select
    End Function
    
    Private Function State(ByVal Num As Integer) As String
    Select Case Num
    Case Is = 0
    State = "NSW"
    Case Is = 1
    State = "QLD"
    Case Is = 2
    State = "VIC"
    Case Is = 3
    State = "NT"
    Case Is = 4
    State = "WA"
    Case Is = 5
    State = "SA"
    Case Is = 6
    State = "TAS"
    End Select
    End Function
    
    Private Function Street(ByVal Num As Integer) As String
    Select Case Num
    Case Is = 0
    Street = " ST"
    Case Is = 1
    Street = " RD"
    Case Is = 2
    Street = " AVE"
    Case Is = 3
    Street = " AV"
    Case Is = 4
    Street = " CRES"
    Case Is = 5
    Street = " LOOP"
    Case Is = 6
    Street = "PO BOX"
    Case Is = 7
    Street = " STREET"
    Case Is = 8
    Street = " ROAD"
    Case Is = 9
    Street = " AVENUE"
    Case Is = 10
    Street = " CRESENT"
    Case Is = 11
    Street = " PARADE"
    Case Is = 12
    Street = " PDE"
    Case Is = 13
    Street = " LANE"
    Case Is = 14
    Street = " COURT"
    Case Is = 15
    Street = " BLVD"
    Case Is = 16
    Street = "P.O. BOX"
    Case Is = 17
    Street = "P.O BOX"
    Case Is = 18
    Street = "PO BOX"
    Case Is = 19
    Street = "POBOX"
    End Select
    End Function
    
    0 讨论(0)
  • 2020-11-22 14:15

    I saw this question a lot when I worked for an address verification company. I'm posting the answer here to make it more accessible to programmers who are searching around with the same question. The company I was at processed billions of addresses, and we learned a lot in the process.

    First, we need to understand a few things about addresses.

    Addresses are not regular

    This means that regular expressions are out. I've seen it all, from simple regular expressions that match addresses in a very specific format, to this:

    /\s+(\d{2,5}\s+)(?![a|p]m\b)(([a-zA-Z|\s+]{1,5}){1,2})?([\s|,|.]+)?(([a-zA-Z|\s+]{1,30}){1,4})(court|ct|street|st|drive|dr|lane|ln|road|rd|blvd)([\s|,|.|;]+)?(([a-zA-Z|\s+]{1,30}){1,2})([\s|,|.]+)?\b(AK|AL|AR|AZ|CA|CO|CT|DC|DE|FL|GA|GU|HI|IA|ID|IL|IN|KS|KY|LA|MA|MD|ME|MI|MN|MO|MS|MT|NC|ND|NE|NH|NJ|NM|NV|NY|OH|OK|OR|PA|RI|SC|SD|TN|TX|UT|VA|VI|VT|WA|WI|WV|WY)([\s|,|.]+)?(\s+\d{5})?([\s|,|.]+)/i

    ... to this where a 900+ line-class file generates a supermassive regular expression on the fly to match even more. I don't recommend these (for example, here's a fiddle of the above regex, that makes plenty of mistakes). There isn't an easy magic formula to get this to work. In theory and by theory, it's not possible to match addresses with a regular expression.

    USPS Publication 28 documents the many formats of addresses that are possible, with all their keywords and variatons. Worst of all, addresses are often ambiguous. Words can mean more than one thing ("St" can be "Saint" or "Street") and there are words that I'm pretty sure they invented. (Who knew that "Stravenue" was a street suffix?)

    You'd need some code that really understands addresses, and if that code does exist, it's a trade secret. But you could probably roll your own if you're really into that.

    Addresses come in unexpected shapes and sizes

    Here are some contrived (but complete) addresses:

    1)  102 main street
        Anytown, state
    
    2)  400n 600e #2, 52173
    
    3)  p.o. #104 60203
    

    Even these are possibly valid:

    4)  829 LKSDFJlkjsdflkjsdljf Bkpw 12345
    
    5)  205 1105 14 90210
    

    Obviously, these are not standardized. Punctuation and line breaks not guaranteed. Here's what's going on:

    1. Number 1 is complete because it contains a street address and a city and state. With that information, there's enough identify the address, and it can be considered "deliverable" (with some standardization).

    2. Number 2 is complete because it also contains a street address (with secondary/unit number) and a 5-digit ZIP code, which is enough to identify an address.

    3. Number 3 is a complete post office box format, as it contains a ZIP code.

    4. Number 4 is also complete because the ZIP code is unique, meaning that a private entity or corporation has purchased that address space. A unique ZIP code is for high-volume or concentrated delivery spaces. Anything addressed to ZIP code 12345 goes to General Electric in Schenectady, NY. This example won't reach anyone in particular, but the USPS would still be able to deliver it.

    5. Number 5 is also complete, believe it or not. With just those numbers, the full address can be discovered when parsed against a database of all possible addresses. Filling in the missing directionals, secondary designator, and ZIP+4 code is trivial when you see each number as a component. Here's what it looks like, fully expanded and standardized:

    205 N 1105 W Apt 14

    Beverly Hills CA 90210-5221

    Address data is not your own

    In most countries that provide official address data to licensed vendors, the address data itself belongs to the governing agency. In the US, the USPS owns the addresses. The same is true for Canada Post, Royal Mail, and others, though each country enforces or defines ownership a little differently. Knowing this is important, since it usually forbids reverse-engineering the address database. You have to be careful how to acquire, store, and use the data.

    Google Maps is a common go-to for quick address fixes, but the TOS is rather prohibitive; for example, you can't use their data or APIs without showing a Google Map, and for non-commerical purposes only (unless you pay), and you can't store the data (except for temporary caching). Makes sense. Google's data is some of the best in the world. However, Google Maps does not verify the address. If an address does not exist, it will still show you where the address would be if it did exist (try it on your own street; use a house number that you know doesn't exist). This is useful sometimes, but be aware of that.

    Nominatim's usage policy is similarly limiting, especially for high volume and commercial use, and the data is mostly drawn from free sources, so it isn't as well maintained (such is the nature of open projects) -- however, this may still suit your needs. It is supported by a great community.

    The USPS itself has an API, but it goes down a lot and comes with no guarantees nor support. It might also be hard to use. Some people use it sparingly with no problems. But it's easy to miss that the USPS requires that you use their API only for confirming addresses to ship through them.

    People expect addresses to be hard

    Unfortunately, we've conditioned our society to expect addresses to be complicated. There's dozens of good UX articles all over the Internet about this, but the fact is, if you have an address form with individual fields, that's what users expect, even though it makes it harder for edge-case addresses that don't fit the format the form is expecting, or maybe the form requires a field it shouldn't. Or users don't know where to put a certain part of their address.

    I could go on and on about the bad UX of checkout forms these days, but instead I'll just say that combining the addresses into a single field will be a welcome change -- people will be able to type their address how they see fit, rather than trying to figure out your lengthy form. However, this change will be unexpected and users may find it a little jarring at first. Just be aware of that.

    Part of this pain can be alleviated by putting the country field out front, before the address. When they fill out the country field first, you know how to make your form appear. Maybe you have a good way to deal with single-field US addresses, so if they select United States, you can reduce your form to a single field, otherwise show the component fields. Just things to think about!

    Now we know why it's hard; what can you do about it?

    The USPS licenses vendors through a process called CASS™ Certification to provide verified addresses to customers. These vendors have access to the USPS database, updated monthly. Their software must conform to rigorous standards to be certified, and they don't often require agreement to such limiting terms as discussed above.

    There are many CASS-Certified companies that can process lists or have APIs: Melissa Data, Experian QAS, and SmartyStreets to name a few.

    (Due to getting flak for "advertising" I've truncated my answer at this point. It's up to you to find a solution that works for you.)

    The Truth: Really, folks, I don't work at any of these companies. It's not an advertisement.

    0 讨论(0)
  • 2020-11-22 14:19

    If you want to rely on OSM data libpostal is very powerful and handles a lot of the most common caveats with address inputs.

    0 讨论(0)
  • 2020-11-22 14:24

    For US Address Parsing,

    I prefer using usaddress package that is available in pip for usaddress only

    python3 -m pip install usaddress
    

    Documentation
    PyPi

    This worked well for me for US address.

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    
    # address_parser.py
    import sys
    from usaddress import tag
    from json import dumps, loads
    
    if __name__ == '__main__':
        tag_mapping = {
            'Recipient': 'recipient',
            'AddressNumber': 'addressStreet',
            'AddressNumberPrefix': 'addressStreet',
            'AddressNumberSuffix': 'addressStreet',
            'StreetName': 'addressStreet',
            'StreetNamePreDirectional': 'addressStreet',
            'StreetNamePreModifier': 'addressStreet',
            'StreetNamePreType': 'addressStreet',
            'StreetNamePostDirectional': 'addressStreet',
            'StreetNamePostModifier': 'addressStreet',
            'StreetNamePostType': 'addressStreet',
            'CornerOf': 'addressStreet',
            'IntersectionSeparator': 'addressStreet',
            'LandmarkName': 'addressStreet',
            'USPSBoxGroupID': 'addressStreet',
            'USPSBoxGroupType': 'addressStreet',
            'USPSBoxID': 'addressStreet',
            'USPSBoxType': 'addressStreet',
            'BuildingName': 'addressStreet',
            'OccupancyType': 'addressStreet',
            'OccupancyIdentifier': 'addressStreet',
            'SubaddressIdentifier': 'addressStreet',
            'SubaddressType': 'addressStreet',
            'PlaceName': 'addressCity',
            'StateName': 'addressState',
            'ZipCode': 'addressPostalCode',
        }
        try:
            address, _ = tag(' '.join(sys.argv[1:]), tag_mapping=tag_mapping)
        except:
            with open('failed_address.txt', 'a') as fp:
                fp.write(sys.argv[1] + '\n')
            print(dumps({}))
        else:
            print(dumps(dict(address)))
    

    Running the address_parser.py

     python3 address_parser.py 9757 East Arcadia Ave. Saugus MA 01906
     {"addressStreet": "9757 East Arcadia Ave.", "addressCity": "Saugus", "addressState": "MA", "addressPostalCode": "01906"}
    
    0 讨论(0)
提交回复
热议问题