Chunking sentences using the word 'but' with RegEx

十年热恋 提交于 2021-02-07 20:43:04


I am attempting to chunk sentences using RegEx at the word 'but' (or any other coordinating conjunction words). It's not working...

sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees(): 
    if subtree.label() == 'CHUNK': print(subtree.node())

I need to split the sentence "There are no large collections present but there is spinal canal stenosis." into two:

1. "There are no large collections present"
2. "there is spinal canal stenosis."

I also wish to use the same code to split sentences at 'and' and other coordinating conjunction (CC) words. But my code isn't working. Please help.


I think you can simply do

import re
result = re.split(r"\s+(?:but|and)\s+", sentence)


`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)
`(?:`       Match the regular expression below, do not capture
            Match either the regular expression below (attempting the next alternative only if this one fails)
  `but`     Match the characters "but" literally
  `|`       Or match regular expression number 2 below (the entire group fails if this one fails to match)
  `and`     Match the characters "and" literally
`\s`        Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.)
`+`         Between one and unlimited times, as many times as possible, giving back as needed (greedy)

You can add more conjunction words in there separated by a pipe-character |. Take care though that these words do not contain characters that have special meaning in regex. If in doubt, escape them first with re.escape(word)


If you want to avoid hardcoding conjunction words like 'but' and 'and', try chinking along with chunking:

import nltk
Digdug = nltk.RegexpParser(r""" 
{<.*>+}          # Chunk everything
}<CC>+{      # Chink sequences of CC
sentence = nltk.pos_tag(nltk.word_tokenize("There are no large collections present but there is spinal canal stenosis."))

result = Digdug.parse(sentence)

for subtree in result.subtrees(filter=lambda t: t.label() == 
            print (subtree)

Chinking basically excludes what we dont need from a chunk phrase - 'but' in this case. For more details , see:

