I am processing addresses into their respective field format for the database. I can get the house number out and the street type but trying to determine best method to get the
Carefully check your dataset to make sure if this problem hasn't already been handled for you.
I spent a fair amount of time first creating a taxonomy of probably street name ending, using regexp conditionals to try to pluck out the street number from the full address strings and everything and it turned out that the attributes table for my shapefiles had already segmented out these components.
Before you go forward with the process of parsing address strings, which is always a bit of a chore due to the inevitably strange variations (some parcel addresses are for landlocked parcels and have weird addresses, etc), make sure your dataset hasn't already done this for you!!!
but if you don't, run through the address strings, address.split(" ")
creates an array of 'words'. In most cases the first "word" is the street number. That worked for about 95% of my addresses. (NOTE: my :address strings did not contain city, county, state, zip, they were only the local addresses)
I ran through the entire population of addresses and plucked the last "word" from each address & examined this array & plucked out any "words" that were not "Lane", "Road", "Rd" or whatever. From this list of address endings I created this huge matching regexp object
streetnm_endings = street_endings.map {|s| /#{s}/ }
endings_matches = Regexp.union(street_endings)
I ran through each address string, shift
-ing out the first array member because, again that was the almost always the street number. And then gsub'd out the street endings to get what should be the street name sans street number or street name endings, which databases do not like generally:
parcels.each do |p|
remainder = p.address.split(" ")
p.streetnum = remainder.shift
p.streetname = remainder.join(" ").gsub(endings_matches, "")
p.save
end
It didn't always work but it worked most of the time.
I currently just pass whatever I am given to googlemaps and have them send back a formatted street address that is very easy to parse.
function addressReview(addressInput) {
geocoder = new google.maps.Geocoder();
var latlng = new google.maps.LatLng(-34.397, 150.644);
geocoder.geocode( { 'address': addressInput}, function(results, status) {
if (status == google.maps.GeocoderStatus.OK) {
if (results[0]) {
var addr = results[0].formatted_address;
var latTi = results[0].geometry.location.lat();
var lonGi = results[0].geometry.location.lng();
$.post('/welcome/gcode',{ add: addr , la: latTi , lo: lonGi });
$('#cust_addy').val(addr);
} else {
$('#cust_addy').attr("placeholder",'Cannnot determine location');
}
} else {
$('#cust_addy').attr("placeholder",'Cannnot determine location');
}
});
}
After that, I just split it up in ruby. with .split(', ') and .split(' ')
You could perhaps use something like:
^\S+ (.+?) \S+$
\S
matches any non white space character
^
matches the beginning of the string
$
matches the end of the string
And (.+?)
captures anything in between the two.
I'd recommend using a library for this if possible, since address parsing can be difficult. Check out the Indirizzo Ruby gem, which makes this easy:
require 'Indirizzo'
address = Indirizzo::Address.new("7707 Foo Bar Blvd")
address.number
=> "7707"
address.street
=> ["foo bar blvd", "foo bar boulevard"]
Even if you don't use the Indirizzo library itself, reading through its source code is probably very useful to see how they solved the problem. For instance, it has finely-tuned regular expressions to match different parts of an address:
Match = {
# FIXME: shouldn't have to anchor :number and :zip at start/end
:number => /^(\d+\W|[a-z]+)?(\d+)([a-z]?)\b/io,
:street => /(?:\b(?:\d+\w*|[a-z'-]+)\s*)+/io,
:city => /(?:\b[a-z][a-z'-]+\s*)+/io,
:state => State.regexp,
:zip => /\b(\d{5})(?:-(\d{4}))?\b/o,
:at => /\s(at|@|and|&)\s/io,
:po_box => /\b[P|p]*(OST|ost)*\.*\s*[O|o|0]*(ffice|FFICE)*\.*\s*[B|b][O|o|0][X|x]\b/
}
These files from its source code can give more specifics:
(But I would also generally agree with @drhenner's comment that, in order to make this easier on yourself, you could probably just accept these data inputs in separate fields.)
Edit: To give a more specific answer about how to remove the street suffix (e.g., "Blvd"), you could use Indirizzo's regular expression constants (such as Suffix_Type
from constants.rb
) like so:
address = Indirizzo::Address.new("7707 Foo Bar Blvd", :expand_streets => false)
address.street.map {|street| street.gsub(Indirizzo::Suffix_Type.regexp, '').strip }
=> ["foo bar"]
(Notice I also passed :expand_streets => false
to the initializer, to avoid having both "Blvd" and "Boulevard" alternatives expanded, since we're discarding the suffix anyway.)
You can play fast and loose with named capture groups in a regex
matches = res[:address].match(/^(?<number>\S*)\s+(?<name>.*)\s+(?<type>.*)$/)
number = matches[:number]
house = matches[:name]
street_type = matches[:type]
or if you wanted your regex to be a little more accurate with the type you could replace
(?<type>.*)
with
(?<type>(Blvd|Ave|Rd|St))
and add all the different options you'd want