Split Sentences at Bullets and Numbering?

前端 未结 2 602

I am trying to input text into my word processor to be split into sentences first and then into words.

An example paragraph:

When the blow was repeated,t         


        
相关标签:
2条回答
  • 2021-01-26 19:38

    this can be a solution. you can customize it according to your data

    text = """When the blow was repeated,together with an admonition in
    childish sentences, he turned over upon his back, and held his paws in a peculiar manner.
    
    1) This a numbered sentence
    2) This is the second numbered sentence
    
    At the same time with his ears and his eyes he offered a small prayer to the child.
    
    Below are the examples
    - This an example of bullet point sentence
    - This is also an example of bullet point sentence"""
    
    
    
    import re
    import nltk
    
    sentences = nltk.sent_tokenize(text)
    results = []
    
    for sent in sentences:
        sent = re.sub(r'(\n)(-|[0-9])', r"\1\n\2", sent)
        sent = sent.split('\n\n')
        for s in sent:
            results.append(nltk.word_tokenize(s))
    
    results
    
    [
    ['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'], 
    ['1', ')', 'This', 'a', 'numbered', 'sentence']
    ['2', ')', 'This', 'is', 'the', 'second', 'numbered', 'sentence']
    ['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
    ['Below', 'are', 'the', 'examples']
    ['-', 'This', 'an', 'example', 'of', 'bullet', 'point', 'sentence']
    ['-', 'This', 'also','an', 'example', 'of', 'bullet', 'point', 'sentence']
    ]
    
    0 讨论(0)
  • 2021-01-26 19:40

    I'm not sure about spaCy. In Ruby you could use PragmaticSegmenter and PragmaticTokenizer.

    text = "When the blow was repeated,together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner.\n\n1) This a numbered sentence\n2) This is the second numbered sentence\n\nAt the same time with his ears and his eyes he offered a small prayer to the child.\n\nBelow are the examples\n- This an example of bullet point sentence\n- This is also an example of bullet point sentence"
    
    final_array = []
    segments = PragmaticSegmenter::Segmenter.new(text: text).segment
    segments.each do |segment|
      final_array << PragmaticTokenizer::Tokenizer.new(downcase: false).tokenize(segment)
    end
    
    final_array
    
    => 
    [
      ["When", "the", "blow", "was", "repeated", ",", "together", "with", "an", "admonition", "in", "childish", "sentences", ",", "he", "turned", "over", "upon", "his", "back", ",", "and", "held", "his", "paws", "in", "a", "peculiar", "manner", "."], 
      ["1", ")", "This", "a", "numbered", "sentence"], 
      ["2", ")", "This", "is", "the", "second", "numbered", "sentence"], 
      ["At", "the", "same", "time", "with", "his", "ears", "and", "his", "eyes", "he", "offered", "a", "small", "prayer", "to", "the", "child", "."], 
      ["Below", "are", "the", "examples"], 
      ["-", "This", "an", "example", "of", "bullet", "point", "sentence"], 
      ["-", "This", "is", "also", "an", "example", "of", "bullet", "point", "sentence"]
    ]
    
    0 讨论(0)
提交回复
热议问题