Extracting a person's age from unstructured text in Python

前端 未结 5 1723
栀梦
栀梦 2021-01-18 23:08

I have a dataset of administrative filings that include short biographies. I am trying to extract people\'s ages by using python and some pattern matching. Some example of s

5条回答
  •  离开以前
    2021-01-18 23:54

    Judging by the examples you have given, here is the strategy I propose:

    Step 1:

    Check if the statement has Age in the sentence Regex: (?i)(Age).*?(\d+)

    The above will take care of examples like this:

    -- George F. Rubin(14)(15) age 68 Trustee since: 1997.

    -- Steve Jones, Age: 32

    Step 2:

    -- Check if "%" sign is the sentence, if Yes remove the number with the sign in it

    -- If "Age" is not in the sentence then write a regex to remove all 4 digit numbers. Example regex: \b\d{4}\b

    -- Then see if there are any digits remained in the sentence, that will be your age

    Examples that get covered will be like:

    --Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation" -No numbers will be left

    --"INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006" -- Only 56 will be left

    -- "Mr. Lovallo, 47, was appointed Treasurer in 2011." -- only 47 will be left

    This may not be the complete answer as you can have other patterns also. But since you asked for strategy and the examples you posted, this would work in all the cases

提交回复
热议问题