How can I extract address from raw text using NLTK in python?

前端 未结 3 633
梦如初夏
梦如初夏 2021-02-07 19:30

I have this text

\'\'\'Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you ne

相关标签:
3条回答
  • 2021-02-07 20:03

    Checkout libpostal, a library dedicated to address extraction

    It cannot extract address from raw text but may help in related tasks

    0 讨论(0)
  • 2021-02-07 20:17

    Definitely regular expressions :)

    Something like

    import re
    
    txt = ...
    regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
    address = re.findall(regexp, txt)
    
    # address = ['44 West 22nd Street, New York, NY 12345']
    

    Explanation:

    [0-9]{1,3}: 1 to 3 digits, the address number

    (space): a space between the number and the street name

    .+: street name, any character for any number of occurrences

    ,: a comma and a space before the city

    .+: city, any character for any number of occurrences

    ,: a comma and a space before the state

    [A-Z]{2}: exactly 2 uppercase chars from A to Z

    [0-9]{5}: 5 digits

    re.findall(expr, string) will return an array with all the occurrences found.

    0 讨论(0)
  • 2021-02-07 20:21

    Pyap works best not just for this particular example but also for other addresses contained in texts.

    text = ...
    addresses = pyap.parse(text, country='US')
    
    0 讨论(0)
提交回复
热议问题