Parsing Street Address Using RegEx

前端 未结 2 1370
滥情空心
滥情空心 2021-01-28 23:36

I know there are many questions asked on this topic. I am trying to parse and fetch street addresses from html page. The format of these page do not follow any patterns. Can som

相关标签:
2条回答
  • 2021-01-28 23:47

    Having worked on this problem quite extensively at SmartyStreets, I will tell you "NO" to parsing/finding street addresses with a regex.

    Addresses are not a regular language and cannot be matched by a regular expression.

    To solve the problem, we developed an API which actually finds and extracts addresses, with notably high accuracy. It's free for low-volume use. (It was not an easy problem to solve.) You can try it for free on the homepage demo. And no, this is not a solicitation. If you want to learn more about street addresses in any amount of detail from very basic to very technical, just email us because we want to educate the community about addresses.

    To extract addresses, there are regular expressions under the hood, but results are biased strongly toward those which actually verify, meaning which actually exist. In other words, this is a parser performing complex operations to find and match addresses.

    This answer to a very similar question is related, and you may find it useful. The other answers highlight some important points about the difficulties and solutions for parsing street addresses...

    enter image description here

    0 讨论(0)
  • 2021-01-29 00:02

    Before you get all traditional let me share my experience. I've parsed over 1 million web pages in this way in Java. When I need small pieces out of a page it is perfect when paired with a replace to strip tags. In fact, it is more efficient and faster, especially when using Java's great replaceAll() function to strip tags. Build a fork join pool of both and test some parsing, you won't believe your eyes. I've added that part at the end. This is not the full regex but a starting point since it would take some trial and error to build. I believe the statement was, a bunch of pages with no clear route to the address.

    So, yes, there are ways. What follows is a bit of an introduction to thinking about this in regex.

    Words and groups of words are always in a pattern otherwise they aren't readable. Still, there are several things to note. Addresses can very greatly so it is important to continue building out a regex. The next thing, if you have access to a CAS engine, use it for anything you get. It standardizes your address.

    As a must, have you tried xml, it will narrow everything and can help get rid of tags before you format. You need to narrow everything. If you are using java or python, run this step in a ForkJoinPool or MultiprocessingPool.

    Your process should be:

    1. Narrow if possible
    2. Execute a regex that exploits formatting

    Lastly, here is a regex cheat sheet.

    Keep in mind. I don't know what websites you are using or their formats. I have personally had to pull this data with different per site regexes but that was for odd formats and other issues present with websites that run like databases of a certain variety.

    That said, an address has a format of numbers, then street address and apartment number of pretty much anything, then city, state, then zip code. Basically it is \d+ then any combination of letters and numbers.

    So (in java with double backslashes) to start you off:

    [\\d]+[A-Za-z0-9\\s,\\.]+
    

    If you want to start at but exclude tags to narrow your search if not using xml, use:

    (?<=start)[\\d]+[A-Za-z0-9\\s,\\.]+?(?=end)
    

    Html pages always seem to have tags so that would be something like

    (?<=>)[\\d]+[A-Za-z0-9\\s,\\.]+?(?=<) 
    

    You may be able to use a zip code as your ending place if there is a multi-part zipcode.

    [\\d]+[A-Za-z0-9\\s,\\.]+?[\\d\\-]+
    

    As a final note, you can chain together regexes with a pipe delimeter, e.g.:

    (?<=start)[\\d]+[A-Za-z0-9\\s,\\.]+?[\\d\\-]+|(?<=start)[A-Za-z0-9\\s,\\.]+?(?=end)
    

    If this is not narrow enough there are several additional steps:

    1. compare your results (average word length and etc.) and throw out any great outliers
    2. write a formatter script per site to do cleanup that uses single or multi-threading to replace what you don't need.

    You will probably need to strip out html as well. Run this regex in a replace statement to do that.

    <.*?>
    

    If you have trouble, use something like my regex tester (the website not my own) to build your regex.

    0 讨论(0)
提交回复
热议问题