python机器学习-乳腺癌细胞挖掘(博主亲自录制视频)https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share
http://blog.csdn.net/u010718606/article/details/50148261参考
NLTK中对于很多自然语言处理应用有着开箱即用的api,但是结果往往让人弄不清楚状况。
下面的例子使用NLTK进行命名实体的识别。第一例中,Apple成功被识别出来,而第二例并未被识别。究竟是什么原因导致这样的结果,接下来一探究竟。
In [1]: import nltk In [2]: tokens = nltk.word_tokenize('I am very excited about the next generation of Apple products.') In [3]: tokens = nltk.pos_tag(tokens) In [4]: print tokens [('I', 'PRP'), ('am', 'VBP'), ('very', 'RB'), ('excited', 'JJ'), ('about', 'IN'), ('the', 'DT'), ('next', 'JJ'), ('generation', 'NN'), ('of', 'IN'), ('Apple', 'NNP'), ('products', 'NNS'), ('.', '.')] In [5]: tree = nltk.ne_chunk(tokens) In [6]: print tree (S I/PRP am/VBP very/RB excited/JJ about/IN the/DT next/JJ generation/NN of/IN (GPE Apple/NNP) products/NNS ./.) In [7]: tokens = nltk.word_tokenize('I bought these Apple products today.') In [8]: tokens = nltk.pos_tag(tokens) In [9]: print tokens ['I', 'bought', 'these', 'Apple', 'products', 'today', '.'] In [10]: tree = nltk.ne_chunk(tokens) In [11]: print tree (S I/PRP bought/VBD these/DT Apple/NNP products/NNS today/NN ./.)
最大熵算法
注意到在上述两个例子Apple这个词被词性标注为NNP(NNP是宾夕法尼亚大学树图资料库II为专有名词,单数)。另外,这两个单词都以大写字 母开始。为什么Apple在1例中被标记为GPE(地缘政治实体),而2例未标记?另外,为什么Apple标记为GPE,而不是ORG(组织机构)?
NLTK的命名实体识别是通过使用的MaxEnt分类器。MaxEnt分类器工作有两个原则:1.总是试图保持均匀分布(即最大化熵),2.保持其 统计概率与经验数据一致。经验数据来源于语料库,通过手动标记,所以大多数标记数据并不是免费的。NLTK不提供其训练命名实体识别器的语料库(训练数据 来自ACE(自动内容抽取))。NLTK所提供的是一个pickle文件(在nltk_data/chunkers/目录下),而这个pickle文件, 就是训练好的MaxEnt分类器实例。
➜ maxent_ne_chunker tree . ├── english_ace_binary.pickle ├── english_ace_multiclass.pickle └── PY3 ├── english_ace_binary.pickle └── english_ace_multiclass.pickle
要训练良好的监督学习的算法基于良好的特征。在命名实体识别中,特征可能是这个词是否包含一个大写字母。所以NLTK使用的特征有哪些呢?下面我列出他们:
- 词的形状(是否包含数字/首字母大写/包含符号)
- 词的长度
- 词的前三个字母
- 词尾三个字母
- 词性标签
- 词本身
- 该词是否存在
- 该词前面词的词性(前面是否有名词)
- 前词词性
- 后词词性
- 前词本身
- 后词本身
- …
下面的代码可以列出NLTK中所使用的标签
import nltk # 载入序列化对象 chunker = nltk.data.load('chunkers/maxent_ne_chunker/english_ace_multiclass.pickle') # 最大熵分类器 maxEnt = chunker._tagger.classifier() def maxEnt_report(): maxEnt = chunker._tagger.classifier() print 'These are the labels used by the NLTK\'s NEC...' print maxEnt.labels() print '' print 'These are the most informative features found in the ACE corpora...' maxEnt.show_most_informative_features() def ne_report(sentence, report_all=False): # 词性标记 tokens = nltk.word_tokenize(sentence) tokens = nltk.pos_tag(tokens) tags = [] for i in range(0, len(tokens)): featureset = chunker._tagger.feature_detector(tokens, i, tags) tag = chunker._tagger.choose_tag(tokens, i, tags) if tag != 'O' or report_all: print '\nExplanation on the why the word \'' + tokens[i][0] + '\' was tagged:' featureset = chunker._tagger.feature_detector(tokens, i, tags) maxEnt.explain(featureset) tags.append(tag)
下面的输出报告中列出了NLTK所使用的标签,”I-“,”B-“, “O”前缀的含义为包含/开始/例外(inside/begin/others)标记。当一块开始,第一个词是前缀“B”来表示这个词是一个块的开始。下 一个单词,如果它属于同一块,将以”I-“前缀,表示这是块的一部分,而不是开始。如果一个词不属于一块,贴上“O”,这意味着它是在外面。
➜ test python dd.py These are the labels used by the NLTK's NEC... ['I-GSP', 'B-LOCATION', 'B-GPE', 'I-ORGANIZATION', 'I-PERSON', 'O', 'I-FACILITY', 'I-LOCATION', 'B-PERSON', 'B-FACILITY', 'B-GSP', 'B-ORGANIZATION', 'I-GPE'] These are the most informative features found in the ACE corpora... 10.125 bias==True and label is 'O' 6.631 suffix3=='day' and label is 'O' -6.207 bias==True and label is 'I-GSP' 5.628 prevtag=='O' and label is 'O' -4.740 shape=='upcase' and label is 'O' 4.106 shape+prevtag=='<function shape at 0x8bde0d4>+O' and label is 'O' -3.994 shape=='mixedcase' and label is 'O' 3.992 pos+prevtag=='NNP+B-PERSON' and label is 'I-PERSON' 3.890 prevtag=='I-ORGANIZATION' and label is 'I-ORGANIZATION' 3.879 shape+prevtag=='<function shape at 0x8bde0d4>+I-ORGANIZATION' and label is 'I-ORGANIZATION'
Note:
- GPE is Geo-Political Entity
- GSP is Geo-Socio-Political group
例1输出:
Explanation on the why the word 'Apple' was tagged: Feature B-GPE O B-ORGAN B-GSP -------------------------------------------------------------------------------- prevtag=='O' (1) 3.767 shape=='upcase' (1) 2.701 pos+prevtag=='NNP+O' (1) 2.254 en-wordlist==False (1) 2.095 label is 'B-GPE' (1) -2.005 bias==True (1) -1.975 prevword=='of' (1) 0.742 pos=='NNP' (1) 0.681 nextpos=='nns' (1) 0.661 prevpos=='IN' (1) 0.311 wordlen==5 (1) 0.113 nextword=='products' (1) 0.060 bias==True (1) 10.125 prevtag=='O' (1) 5.628 shape=='upcase' (1) -4.740 prevpos=='IN' (1) -1.668 label is 'O' (1) -1.075 pos=='NNP' (1) -1.024 suffix3=='ple' (1) 0.797 en-wordlist==False (1) 0.698 wordlen==5 (1) -0.449 prevword=='of' (1) -0.217 nextpos=='nns' (1) 0.104 prefix3=='app' (1) 0.089 pos+prevtag=='NNP+O' (1) 0.011 nextword=='products' (1) 0.005 prevtag=='O' (1) 3.389 pos+prevtag=='NNP+O' (1) 1.725 bias==True (1) 0.955 en-wordlist==False (1) 0.837 label is 'B-ORGANIZATION' (1) 0.718 nextpos=='nns' (1) 0.365 wordlen==5 (1) -0.351 pos=='NNP' (1) 0.174 prevpos=='IN' (1) -0.139 prevword=='of' (1) 0.131 prefix3=='app' (1) -0.126 shape=='upcase' (1) -0.084 suffix3=='ple' (1) -0.077 prevtag=='O' (1) 2.925 pos+prevtag=='NNP+O' (1) 2.213 shape=='upcase' (1) 0.929 en-wordlist==False (1) 0.891 bias==True (1) -0.592 label is 'B-GSP' (1) -0.565 prevpos=='IN' (1) 0.410 nextpos=='nns' (1) 0.399 pos=='NNP' (1) 0.393 prevword=='of' (1) 0.184 wordlen==5 (1) 0.177 --------------------------------------------------------------------------------- TOTAL: 9.406 8.283 7.515 7.366 PROBS: 0.453 0.208 0.122 0.110
最后一行中列出的概率加起来加起来是0.893,而非1。这是因为只输出概率最大的四类标签。
例2输出:
Explanation on the why the word 'Apple' was tagged: Feature O B-GPE B-ORGAN B-LOCAT -------------------------------------------------------------------------------- bias==True (1) 10.125 prevtag=='O' (1) 5.628 shape=='upcase' (1) -4.740 label is 'O' (1) -1.075 pos=='NNP' (1) -1.024 suffix3=='ple' (1) 0.797 en-wordlist==False (1) 0.698 prevpos=='DT' (1) 0.585 wordlen==5 (1) -0.449 nextpos=='nns' (1) 0.104 prefix3=='app' (1) 0.089 prevword=='these' (1) -0.024 pos+prevtag=='NNP+O' (1) 0.011 nextword=='products' (1) 0.005 prevtag=='O' (1) 3.767 shape=='upcase' (1) 2.701 pos+prevtag=='NNP+O' (1) 2.254 en-wordlist==False (1) 2.095 label is 'B-GPE' (1) -2.005 bias==True (1) -1.975 pos=='NNP' (1) 0.681 nextpos=='nns' (1) 0.661 prevpos=='DT' (1) -0.181 wordlen==5 (1) 0.113 nextword=='products' (1) 0.060 prevtag=='O' (1) 3.389 pos+prevtag=='NNP+O' (1) 1.725 bias==True (1) 0.955 en-wordlist==False (1) 0.837 label is 'B-ORGANIZATION' (1) 0.718 prevpos=='DT' (1) -0.494 nextpos=='nns' (1) 0.365 wordlen==5 (1) -0.351 pos=='NNP' (1) 0.174 prefix3=='app' (1) -0.126 shape=='upcase' (1) -0.084 suffix3=='ple' (1) -0.077 prevword=='these' (1) 0.067 prevtag=='O' (1) 2.682 label is 'B-LOCATION' (1) -2.038 pos+prevtag=='NNP+O' (1) 1.724 shape=='upcase' (1) 1.275 prefix3=='app' (1) 1.169 bias==True (1) 0.747 prevpos=='DT' (1) 0.745 pos=='NNP' (1) 0.616 en-wordlist==False (1) -0.309 nextpos=='nns' (1) 0.151 wordlen==5 (1) 0.041 --------------------------------------------------------------------------------- TOTAL: 10.730 8.171 7.095 6.802 PROBS: 0.697 0.118 0.056 0.046
由此:1和2中在GPE识别中唯一的区别在于下面三行:
prevword==’of’ (1) 0.742
prevpos==’IN’ (1) 0.311
prevpos==’DT’ (1) -0.181
可见,1中
1中的Apple被识别为B-GPE,而2中的Apple被识别为O。
引用:
[1] http://www.nltk.org/book/ch07.html
[2] http://spark-public.s3.amazonaws.com/nlp/slides/Information_Extraction_and_Named_Entity_Recognition_v2.pdf
[3] http://www.mattshomepage.com/#/blog/feb2013/liftingthehood
https://study.163.com/provider/400000000398149/index.htm?share=2&shareId=400000000398149(博主视频教学主页)
来源:https://www.cnblogs.com/webRobot/p/6088625.html