问题
I'm loading a POS-tagged corpus in NLTK, and I would like to find certain patterns involving POS tags. These patterns can be quite complex, including a lot of different combinations of POS tags. Example input string:
We/PRP spent/VBD some/DT time/NN reading/NN about/IN the/DT historical/JJ importance/NN of/IN tea/NN in/IN Korea/NNP and/CC China/NNP and/CC then/RB tasted/VBD the/DT most/JJS expensive/JJ green/JJ tea/NN I/PRP have/VBP ever/RB seen/VBN ./.
In this case the POS pattern is something like: (IN) (THE)? (NNP) (CC)? (NNP)
...
I'm loading my corpus with:
reader = TaggedCorpusReader(corpus_dir, r'.*\.pos')
Clearly, I can do this using Python's re
package, but such regular expressions quickly become hard to understand, debug, and update for other developers.
What is the most elegant way of doing this in NLTK? Are there helper functions to find patterns in POS-tagged text more readable than usual regex?
Thanks
回答1:
There's a function in NLTK called str2tuple which parses tagged sentence into list of tuples, You can then easily extract POS tags into a seperate list. No need for regex.
来源:https://stackoverflow.com/questions/15970033/regular-expressions-in-pos-tagged-nltk-corpus