Splitting chinese document into sentences [closed]

雨燕双飞 提交于 2019-12-04 09:18:27

Using some regex tricks in Python (c.f. a modified regex of Section 2.3 of http://aclweb.org/anthology/Y/Y11/Y11-1038.pdf):

import re

paragraph = u'\u70ed\u5e26\u98ce\u66b4\u5c1a\u5854\u5c14\u662f2001\u5e74\u5927\u897f\u6d0b\u98d3\u98ce\u5b63\u7684\u4e00\u573a\u57288\u6708\u7a7f\u8d8a\u4e86\u52a0\u52d2\u6bd4\u6d77\u7684\u5317\u5927\u897f\u6d0b\u70ed\u5e26\u6c14\u65cb\u3002\u5c1a\u5854\u5c14\u4e8e8\u670814\u65e5\u7531\u70ed\u5e26\u5927\u897f\u6d0b\u7684\u4e00\u80a1\u4e1c\u98ce\u6ce2\u53d1\u5c55\u800c\u6210\uff0c\u5176\u5b58\u5728\u7684\u5927\u90e8\u5206\u65f6\u95f4\u91cc\u90fd\u5728\u5feb\u901f\u5411\u897f\u79fb\u52a8\uff0c\u9000\u5316\u6210\u4e1c\u98ce\u6ce2\u540e\u7a7f\u8d8a\u4e86\u5411\u98ce\u7fa4\u5c9b\u3002'

def zng(paragraph):
    for sent in re.findall(u'[^!?。\.\!\?]+[!?。\.\!\?]?', paragraph, flags=re.U):
        yield sent

list(zng(paragraph))

Regex explanation: https://regex101.com/r/eNFdqM/2


Either of these open sources projects should be useful afaik:

For unsegmented text, using the Stanford libraries, you probably want to use their Chinese CoreNLP. This isn't as well documented as the base corenlp, but it will work for your task.

http://nlp.stanford.edu/software/corenlp-faq.shtml#languages http://nlp.stanford.edu/software/corenlp.shtml

You will want the segmenter and the sentence splitter. "segment, ssplit" The others are not relevant.

Alternatively, you can use the WordToSentenceSplitter class in edu.stanford.nlp.process.WordToSentenceSplitter directly. If you do that, you can look at how it is used in WordsToSentencesAnnotator.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!