Extracting a person's age from unstructured text in Python

前端 未结 5 1724
栀梦
栀梦 2021-01-18 23:08

I have a dataset of administrative filings that include short biographies. I am trying to extract people\'s ages by using python and some pattern matching. Some example of s

相关标签:
5条回答
  • 2021-01-18 23:37

    This will work for all the cases you provided: https://repl.it/repls/NotableAncientBackground

    import re 
    
    input =["Mr Bond, 67, is an engineer in the UK"
    ,"Amanda B. Bynes, 34, is an actress"
    ,"Peter Parker (45) will be our next administrator"
    ,"Mr. Dylan is 46 years old."
    ,"Steve Jones, Age:32,", "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation",
    "George F. Rubin(14)(15) Age 68 Trustee since: 1997.",
    "INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006",
    "Mr. Lovallo, 47, was appointed Treasurer in 2011.",
    "Mr. Charles Baker, 79, is a business advisor to biotechnology companies.",
    "Mr. Botein, age 43, has been a member of our Board since our formation."]
    for i in input:
      age = re.findall(r'Age[\:\s](\d{1,3})', i)
      age.extend(re.findall(r' (\d{1,3}),? ', i))
      if len(age) == 0:
        age = re.findall(r'\((\d{1,3})\)', i)
      print(i+ " --- AGE: "+ str(set(age)))
    

    Returns

    Mr Bond, 67, is an engineer in the UK --- AGE: {'67'}
    Amanda B. Bynes, 34, is an actress --- AGE: {'34'}
    Peter Parker (45) will be our next administrator --- AGE: {'45'}
    Mr. Dylan is 46 years old. --- AGE: {'46'}
    Steve Jones, Age:32, --- AGE: {'32'}
    Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation --- AGE: set()
    George F. Rubin(14)(15) Age 68 Trustee since: 1997. --- AGE: {'68'}
    INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006 --- AGE: {'56'}
    Mr. Lovallo, 47, was appointed Treasurer in 2011. --- AGE: {'47'}
    Mr. Charles Baker, 79, is a business advisor to biotechnology companies. --- AGE: {'79'}
    Mr. Botein, age 43, has been a member of our Board since our formation. --- AGE: {'43'}
    
    0 讨论(0)
  • 2021-01-18 23:46

    a simple way to find the age of a person from your sentences will be to extract a number with 2 digits:

    import re
    
    sentence = 'Steve Jones, Age: 32,'
    print(re.findall(r"\b\d{2}\b", 'Steve Jones, Age: 32,')[0])
    
    # output: 32
    

    if you do not want % to be at the end of your number and also you want to have a white space in the begening you could do:

    sentence = 'Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation'
    
    match = re.findall(r"\b\d{2}(?!%)[^\d]", sentence)
    
    if match:
        print(re.findall(r"\b\d{2}(?!%)[^\d]", sentence)[0][:2])
    else:
        print('no match')
    
    # output: no match
    

    works well also for the previous sentence

    0 讨论(0)
  • 2021-01-18 23:53
    import re 
    
    x =["Mr Bond, 67, is an engineer in the UK"
    ,"Amanda B. Bynes, 34, is an actress"
    ,"Peter Parker (45) will be our next administrator"
    ,"Mr. Dylan is 46 years old."
    ,"Steve Jones, Age:32,"]
    
    [re.findall(r'\d{1,3}', i)[0] for i in x] # ['67', '34', '45', '46', '32']
    
    0 讨论(0)
  • 2021-01-18 23:54

    Judging by the examples you have given, here is the strategy I propose:

    Step 1:

    Check if the statement has Age in the sentence Regex: (?i)(Age).*?(\d+)

    The above will take care of examples like this:

    -- George F. Rubin(14)(15) age 68 Trustee since: 1997.

    -- Steve Jones, Age: 32

    Step 2:

    -- Check if "%" sign is the sentence, if Yes remove the number with the sign in it

    -- If "Age" is not in the sentence then write a regex to remove all 4 digit numbers. Example regex: \b\d{4}\b

    -- Then see if there are any digits remained in the sentence, that will be your age

    Examples that get covered will be like:

    --Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation" -No numbers will be left

    --"INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006" -- Only 56 will be left

    -- "Mr. Lovallo, 47, was appointed Treasurer in 2011." -- only 47 will be left

    This may not be the complete answer as you can have other patterns also. But since you asked for strategy and the examples you posted, this would work in all the cases

    0 讨论(0)
  • 2021-01-19 00:00

    Since your text has to be processed, and not only pattern matched, the correct approach is to use one of the many NLP tools available out there.

    Your aim is to use Named Entity Recognition (NER) which is usually done based on Machine Learning Models. The NER activity attempts to recognize a determined set of Entity Types in text. Examples are: Locations, Dates, Organizations and Person names.

    While not 100% precise, this is much more precise than simple pattern matching (especially for english), since it relies on other information other than Patterns, such as Part of Speech (POS), Dependency Parsing, etc.

    Take a look on the results I obtained for the phrases you provided by using Allen NLP Online Tool (using fine-grained-NER model):

    • "Mr Bond, 67, is an engineer in the UK":

    • "Amanda B. Bynes, 34, is an actress"

    • "Peter Parker (45) will be our next administrator"

    • "Mr. Dylan is 46 years old."

    • "Steve Jones, Age: 32,"

    Notice that this last one is wrong. As I said, not 100%, but easy to use.

    The big advantage of this approach: you don't have to make a special pattern for every one of the millions of possibilities available.

    The best thing: you can integrate it into your Python code:

    pip install allennlp
    

    And:

    from allennlp.predictors import Predictor
    al = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/fine- 
    grained-ner-model-elmo-2018.12.21.tar.gz")
    al.predict("Your sentence with date here")
    

    Then, look at the resulting dict for "Date" Entities.

    Same thing goes for Spacy:

    !python3 -m spacy download en_core_web_lg
    import spacy
    sp_lg = spacy.load('en_core_web_lg')
    {(ent.text.strip(), ent.label_) for ent in sp_lg("Your sentence with date here").ents}
    

    (However, I had some bad experiences with bad predictions there - although it is considered better).

    For more info, read this interesting article at Medium: https://medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b

    0 讨论(0)
提交回复
热议问题