I am processing addresses into their respective field format for the database. I can get the house number out and the street type but trying to determine best method to get the
I'd recommend using a library for this if possible, since address parsing can be difficult. Check out the Indirizzo Ruby gem, which makes this easy:
require 'Indirizzo'
address = Indirizzo::Address.new("7707 Foo Bar Blvd")
address.number
=> "7707"
address.street
=> ["foo bar blvd", "foo bar boulevard"]
Even if you don't use the Indirizzo library itself, reading through its source code is probably very useful to see how they solved the problem. For instance, it has finely-tuned regular expressions to match different parts of an address:
Match = {
# FIXME: shouldn't have to anchor :number and :zip at start/end
:number => /^(\d+\W|[a-z]+)?(\d+)([a-z]?)\b/io,
:street => /(?:\b(?:\d+\w*|[a-z'-]+)\s*)+/io,
:city => /(?:\b[a-z][a-z'-]+\s*)+/io,
:state => State.regexp,
:zip => /\b(\d{5})(?:-(\d{4}))?\b/o,
:at => /\s(at|@|and|&)\s/io,
:po_box => /\b[P|p]*(OST|ost)*\.*\s*[O|o|0]*(ffice|FFICE)*\.*\s*[B|b][O|o|0][X|x]\b/
}
These files from its source code can give more specifics:
(But I would also generally agree with @drhenner's comment that, in order to make this easier on yourself, you could probably just accept these data inputs in separate fields.)
Edit: To give a more specific answer about how to remove the street suffix (e.g., "Blvd"), you could use Indirizzo's regular expression constants (such as Suffix_Type
from constants.rb
) like so:
address = Indirizzo::Address.new("7707 Foo Bar Blvd", :expand_streets => false)
address.street.map {|street| street.gsub(Indirizzo::Suffix_Type.regexp, '').strip }
=> ["foo bar"]
(Notice I also passed :expand_streets => false
to the initializer, to avoid having both "Blvd" and "Boulevard" alternatives expanded, since we're discarding the suffix anyway.)