问题
Given a raw string input
1600 Divisadero St
San Francisco, CA 94115
b/t Post St & Sutter St
Lower Pacific Heights
I want to extract
City:San Francisco
state:California
or CA
Country:USA
I'll be parsing millions of addresses and using a Paid API is not feasible
I'm planning to use a Named Entity Recognizer but i'm unable to find a vast quantity of training data to ideally cover any location
Is there an opensource project out there which i may use?
回答1:
OpenStreetMap's geocoding solution Nominatim can be downloaded and set up on your own machine. This is an extremely tedious and time consuming process. You will need 500GB of free disk space, O(10s) of days to do the indexing, but at the end of it, you will have a full fledged geocoder on your own machine which should be able to handle your current needs and many more future ones.
If you go down this route, I recommend first trying out their example web api's to see if the quality is acceptable or not.
Totally worth looking into spending money and getting Google or Bing geocoder instead.
回答2:
@adi92's Answer is the best choice here, but requires a very beefy machine with many many cores and huge RAM to index the entire database. For those requiring lesser computation www.geonames.org is pretty comprehensive enough for city, state, country only.
来源:https://stackoverflow.com/questions/31452180/extracting-city-state-and-country-from-raw-address-string