Regular expressions in POS tagged NLTK corpus

荒凉一梦 提交于 2021-02-08 06:29:14

问题


I'm loading a POS-tagged corpus in NLTK, and I would like to find certain patterns involving POS tags. These patterns can be quite complex, including a lot of different combinations of POS tags. Example input string:

We/PRP spent/VBD some/DT time/NN reading/NN about/IN the/DT historical/JJ importance/NN of/IN tea/NN in/IN Korea/NNP and/CC China/NNP and/CC then/RB tasted/VBD the/DT most/JJS expensive/JJ green/JJ tea/NN I/PRP have/VBP ever/RB seen/VBN ./.

In this case the POS pattern is something like: (IN) (THE)? (NNP) (CC)? (NNP) ...

I'm loading my corpus with:

 reader = TaggedCorpusReader(corpus_dir, r'.*\.pos')

Clearly, I can do this using Python's re package, but such regular expressions quickly become hard to understand, debug, and update for other developers.

What is the most elegant way of doing this in NLTK? Are there helper functions to find patterns in POS-tagged text more readable than usual regex?

Thanks


回答1:


There's a function in NLTK called str2tuple which parses tagged sentence into list of tuples, You can then easily extract POS tags into a seperate list. No need for regex.



来源:https://stackoverflow.com/questions/15970033/regular-expressions-in-pos-tagged-nltk-corpus

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!