I have a dataset of administrative filings that include short biographies. I am trying to extract people\'s ages by using python and some pattern matching. Some example of s
Since your text has to be processed, and not only pattern matched, the correct approach is to use one of the many NLP tools available out there.
Your aim is to use Named Entity Recognition (NER) which is usually done based on Machine Learning Models. The NER activity attempts to recognize a determined set of Entity Types in text. Examples are: Locations, Dates, Organizations and Person names.
While not 100% precise, this is much more precise than simple pattern matching (especially for english), since it relies on other information other than Patterns, such as Part of Speech (POS), Dependency Parsing, etc.
Take a look on the results I obtained for the phrases you provided by using Allen NLP Online Tool (using fine-grained-NER model):
Notice that this last one is wrong. As I said, not 100%, but easy to use.
The big advantage of this approach: you don't have to make a special pattern for every one of the millions of possibilities available.
The best thing: you can integrate it into your Python code:
pip install allennlp
And:
from allennlp.predictors import Predictor
al = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/fine-
grained-ner-model-elmo-2018.12.21.tar.gz")
al.predict("Your sentence with date here")
Then, look at the resulting dict for "Date" Entities.
Same thing goes for Spacy:
!python3 -m spacy download en_core_web_lg
import spacy
sp_lg = spacy.load('en_core_web_lg')
{(ent.text.strip(), ent.label_) for ent in sp_lg("Your sentence with date here").ents}
(However, I had some bad experiences with bad predictions there - although it is considered better).
For more info, read this interesting article at Medium: https://medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b