Sentence Segmentation using Spacy

后端 未结 1 415
说谎
说谎 2021-01-02 08:22

I am new to Spacy and NLP. Facing the below issue while doing sentence segmentation using Spacy.

The text I am trying to tokenise into sentences contains numbered li

相关标签:
1条回答
  • 2021-01-02 08:40

    When you use a pretrained model with spacy, the sentences get splitted based on training data that were provided during the training procedure of the model.

    Of course, there are cases like yours, that may somebody want to use a custom sentence segmentation logic. This is possible by adding a component to spacy pipeline.

    For your case, you can add a rule that prevents sentence splitting when there is a {number}. pattern.

    A workaround for your problem:

    import spacy
    import re
    
    nlp = spacy.load('en')
    boundary = re.compile('^[0-9]$')
    
    def custom_seg(doc):
        prev = doc[0].text
        length = len(doc)
        for index, token in enumerate(doc):
            if (token.text == '.' and boundary.match(prev) and index!=(length - 1)):
                doc[index+1].sent_start = False
            prev = token.text
        return doc
    
    nlp.add_pipe(custom_seg, before='parser')
    text = u'This is first sentence.\nNext is numbered list.\n1. Hello World!\n2. Hello World2!\n3. Hello World!'
    doc = nlp(text)
    for sentence in doc.sents:
        print(sentence.text)
    

    Hope it helps!

    0 讨论(0)
提交回复
热议问题