Parsing Street Address Using RegEx

前端未结

关注

 2  1368

滥情空心 2021-01-28 23:36

I know there are many questions asked on this topic. I am trying to parse and fetch street addresses from html page. The format of these page do not follow any patterns. Can som

2条回答

别那么骄傲 (楼主)

2021-01-29 00:02
Before you get all traditional let me share my experience. I've parsed over 1 million web pages in this way in Java. When I need small pieces out of a page it is perfect when paired with a replace to strip tags. In fact, it is more efficient and faster, especially when using Java's great replaceAll() function to strip tags. Build a fork join pool of both and test some parsing, you won't believe your eyes. I've added that part at the end. This is not the full regex but a starting point since it would take some trial and error to build. I believe the statement was, a bunch of pages with no clear route to the address.

So, yes, there are ways. What follows is a bit of an introduction to thinking about this in regex.

Words and groups of words are always in a pattern otherwise they aren't readable. Still, there are several things to note. Addresses can very greatly so it is important to continue building out a regex. The next thing, if you have access to a CAS engine, use it for anything you get. It standardizes your address.

As a must, have you tried xml, it will narrow everything and can help get rid of tags before you format. You need to narrow everything. If you are using java or python, run this step in a ForkJoinPool or MultiprocessingPool.

Your process should be:
1. Narrow if possible
2. Execute a regex that exploits formatting
Lastly, here is a regex cheat sheet.

Keep in mind. I don't know what websites you are using or their formats. I have personally had to pull this data with different per site regexes but that was for odd formats and other issues present with websites that run like databases of a certain variety.

That said, an address has a format of numbers, then street address and apartment number of pretty much anything, then city, state, then zip code. Basically it is \d+ then any combination of letters and numbers.

So (in java with double backslashes) to start you off:
```
[\\d]+[A-Za-z0-9\\s,\\.]+
```
If you want to start at but exclude tags to narrow your search if not using xml, use:
```
(?<=start)[\\d]+[A-Za-z0-9\\s,\\.]+?(?=end)
```
Html pages always seem to have tags so that would be something like
```
(?<=>)[\\d]+[A-Za-z0-9\\s,\\.]+?(?=<) 
```
You may be able to use a zip code as your ending place if there is a multi-part zipcode.
```
[\\d]+[A-Za-z0-9\\s,\\.]+?[\\d\\-]+
```
As a final note, you can chain together regexes with a pipe delimeter, e.g.:
```
(?<=start)[\\d]+[A-Za-z0-9\\s,\\.]+?[\\d\\-]+|(?<=start)[A-Za-z0-9\\s,\\.]+?(?=end)
```
If this is not narrow enough there are several additional steps:
1. compare your results (average word length and etc.) and throw out any great outliers
2. write a formatter script per site to do cleanup that uses single or multi-threading to replace what you don't need.
You will probably need to strip out html as well. Run this regex in a replace statement to do that.
```
<.*?>
```
If you have trouble, use something like my regex tester (the website not my own) to build your regex.
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...