How to parse freeform street/postal address out of text, and into components

后端 未结 9 1041
感动是毒
感动是毒 2020-11-22 13:40

We do business largely in the United States and are trying to improve user experience by combining all the address fields into a single text area. But there are a few proble

相关标签:
9条回答
  • 2020-11-22 14:26

    No code? For shame!

    Here is a simple JavaScript address parser. It's pretty awful for every single reason that Matt gives in his dissertation above (which I almost 100% agree with: addresses are complex types, and humans make mistakes; better to outsource and automate this - when you can afford to).

    But rather than cry, I decided to try:

    This code works OK for parsing most Esri results for findAddressCandidate and also with some other (reverse)geocoders that return single-line address where street/city/state are delimited by commas. You can extend if you want or write country-specific parsers. Or just use this as case study of how challenging this exercise can be or at how lousy I am at JavaScript. I admit I only spent about thirty mins on this (future iterations could add caches, zip validation, and state lookups as well as user location context), but it worked for my use case: End user sees form that parses geocode search response into 4 textboxes. If address parsing comes out wrong (which is rare unless source data was poor) it's no big deal - the user gets to verify and fix it! (But for automated solutions could either discard/ignore or flag as error so dev can either support the new format or fix source data.)

    /* 
    address assumptions:
    - US addresses only (probably want separate parser for different countries)
    - No country code expected.
    - if last token is a number it is probably a postal code
    -- 5 digit number means more likely
    - if last token is a hyphenated string it might be a postal code
    -- if both sides are numeric, and in form #####-#### it is more likely
    - if city is supplied, state will also be supplied (city names not unique)
    - zip/postal code may be omitted even if has city & state
    - state may be two-char code or may be full state name.
    - commas: 
    -- last comma is usually city/state separator
    -- second-to-last comma is possibly street/city separator
    -- other commas are building-specific stuff that I don't care about right now.
    - token count:
    -- because units, street names, and city names may contain spaces token count highly variable.
    -- simplest address has at least two tokens: 714 OAK
    -- common simple address has at least four tokens: 714 S OAK ST
    -- common full (mailing) address has at least 5-7:
    --- 714 OAK, RUMTOWN, VA 59201
    --- 714 S OAK ST, RUMTOWN, VA 59201
    -- complex address may have a dozen or more:
    --- MAGICICIAN SUPPLY, LLC, UNIT 213A, MAGIC TOWN MALL, 13 MAGIC CIRCLE DRIVE, LAND OF MAGIC, MA 73122-3412
    */
    
    var rawtext = $("textarea").val();
    var rawlist = rawtext.split("\n");
    
    function ParseAddressEsri(singleLineaddressString) {
      var address = {
        street: "",
        city: "",
        state: "",
        postalCode: ""
      };
    
      // tokenize by space (retain commas in tokens)
      var tokens = singleLineaddressString.split(/[\s]+/);
      var tokenCount = tokens.length;
      var lastToken = tokens.pop();
      if (
        // if numeric assume postal code (ignore length, for now)
        !isNaN(lastToken) ||
        // if hyphenated assume long zip code, ignore whether numeric, for now
        lastToken.split("-").length - 1 === 1) {
        address.postalCode = lastToken;
        lastToken = tokens.pop();
      }
    
      if (lastToken && isNaN(lastToken)) {
        if (address.postalCode.length && lastToken.length === 2) {
          // assume state/province code ONLY if had postal code
          // otherwise it could be a simple address like "714 S OAK ST"
          // where "ST" for "street" looks like two-letter state code
          // possibly this could be resolved with registry of known state codes, but meh. (and may collide anyway)
          address.state = lastToken;
          lastToken = tokens.pop();
        }
        if (address.state.length === 0) {
          // check for special case: might have State name instead of State Code.
          var stateNameParts = [lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken];
    
          // check remaining tokens from right-to-left for the first comma
          while (2 + 2 != 5) {
            lastToken = tokens.pop();
            if (!lastToken) break;
            else if (lastToken.endsWith(",")) {
              // found separator, ignore stuff on left side
              tokens.push(lastToken); // put it back
              break;
            } else {
              stateNameParts.unshift(lastToken);
            }
          }
          address.state = stateNameParts.join(' ');
          lastToken = tokens.pop();
        }
      }
    
      if (lastToken) {
        // here is where it gets trickier:
        if (address.state.length) {
          // if there is a state, then assume there is also a city and street.
          // PROBLEM: city may be multiple words (spaces)
          // but we can pretty safely assume next-from-last token is at least PART of the city name
          // most cities are single-name. It would be very helpful if we knew more context, like
          // the name of the city user is in. But ignore that for now.
          // ideally would have zip code service or lookup to give city name for the zip code.
          var cityNameParts = [lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken];
    
          // assumption / RULE: street and city must have comma delimiter
          // addresses that do not follow this rule will be wrong only if city has space
          // but don't care because Esri formats put comma before City
          var streetNameParts = [];
    
          // check remaining tokens from right-to-left for the first comma
          while (2 + 2 != 5) {
            lastToken = tokens.pop();
            if (!lastToken) break;
            else if (lastToken.endsWith(",")) {
              // found end of street address (may include building, etc. - don't care right now)
              // add token back to end, but remove trailing comma (it did its job)
              tokens.push(lastToken.endsWith(",") ? lastToken.substring(0, lastToken.length - 1) : lastToken);
              streetNameParts = tokens;
              break;
            } else {
              cityNameParts.unshift(lastToken);
            }
          }
          address.city = cityNameParts.join(' ');
          address.street = streetNameParts.join(' ');
        } else {
          // if there is NO state, then assume there is NO city also, just street! (easy)
          // reasoning: city names are not very original (Portland, OR and Portland, ME) so if user wants city they need to store state also (but if you are only ever in Portlan, OR, you don't care about city/state)
          // put last token back in list, then rejoin on space
          tokens.push(lastToken);
          address.street = tokens.join(' ');
        }
      }
      // when parsing right-to-left hard to know if street only vs street + city/state
      // hack fix for now is to shift stuff around.
      // assumption/requirement: will always have at least street part; you will never just get "city, state"  
      // could possibly tweak this with options or more intelligent parsing&sniffing
      if (!address.city && address.state) {
        address.city = address.state;
        address.state = '';
      }
      if (!address.street) {
        address.street = address.city;
        address.city = '';
      }
    
      return address;
    }
    
    // get list of objects with discrete address properties
    var addresses = rawlist
      .filter(function(o) {
        return o.length > 0
      })
      .map(ParseAddressEsri);
    $("#output").text(JSON.stringify(addresses));
    console.log(addresses);
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
    <textarea>
    27488 Stanford Ave, Bowden, North Dakota
    380 New York St, Redlands, CA 92373
    13212 E SPRAGUE AVE, FAIR VALLEY, MD 99201
    1005 N Gravenstein Highway, Sebastopol CA 95472
    A. P. Croll &amp; Son 2299 Lewes-Georgetown Hwy, Georgetown, DE 19947
    11522 Shawnee Road, Greenwood, DE 19950
    144 Kings Highway, S.W. Dover, DE 19901
    Intergrated Const. Services 2 Penns Way Suite 405, New Castle, DE 19720
    Humes Realty 33 Bridle Ridge Court, Lewes, DE 19958
    Nichols Excavation 2742 Pulaski Hwy, Newark, DE 19711
    2284 Bryn Zion Road, Smyrna, DE 19904
    VEI Dover Crossroads, LLC 1500 Serpentine Road, Suite 100 Baltimore MD 21
    580 North Dupont Highway, Dover, DE 19901
    P.O. Box 778, Dover, DE 19903
    714 S OAK ST
    714 S OAK ST, RUM TOWN, VA, 99201
    3142 E SPRAGUE AVE, WHISKEY VALLEY, WA 99281
    27488 Stanford Ave, Bowden, North Dakota
    380 New York St, Redlands, CA 92373
    </textarea>
    <div id="output">
    </div>

    0 讨论(0)
  • 2020-11-22 14:29

    There are many street address parsers. They come in two basic flavors - ones that have databases of place names and street names, and ones that don't.

    A regular expression street address parser can get up to about a 95% success rate without much trouble. Then you start hitting the unusual cases. The Perl one in CPAN, "Geo::StreetAddress::US", is about that good. There are Python and Javascript ports of that, all open source. I have an improved version in Python which moves the success rate up slightly by handling more cases. To get the last 3% right, though, you need databases to help with disambiguation.

    A database with 3-digit ZIP codes and US state names and abbreviations is a big help. When a parser sees a consistent postal code and state name, it can start to lock on to the format. This works very well for the US and UK.

    Proper street address parsing starts from the end and works backwards. That's how the USPS systems do it. Addresses are least ambiguous at the end, where country names, city names, and postal codes are relatively easy to recognize. Street names can usually be isolated. Locations on streets are the most complex to parse; there you encounter things such as "Fifth Floor" and "Staples Pavillion". That's when a database is a big help.

    0 讨论(0)
  • 2020-11-22 14:31

    Another option for US based addresses is YAddress (made by the company I work for).

    Many answers to this question suggest geocoding tools as a solution. It is important to not confuse address parsing and geocoding; they are not the same. While geocoders may break down an address into components as a side benefit, they usually rely on non-standard address sets. This means that a geocoder-parsed address may not be the same as the official address. For example, what Google geocoding API calls "6th Ave" in Manhattan, USPS calls "Avenue of the Americas".

    0 讨论(0)
提交回复
热议问题