问题
I am attempting to chunk sentences using RegEx at the word 'but' (or any other coordinating conjunction words). It's not working...
sentence = nltk.pos_tag(word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = nltk.RegexpParser(grammar).parse(sentence)
DigDug = nltk.RegexpParser(r'CHUNK: {.*<CC>.*}')
for subtree in DigDug.parse(sentence).subtrees():
if subtree.label() == 'CHUNK': print(subtree.node())
I need to split the sentence "There are no large collections present but there is spinal canal stenosis."
into two:
1. "There are no large collections present"
2. "there is spinal canal stenosis."
I also wish to use the same code to split sentences at 'and' and other coordinating conjunction (CC) words. But my code isn't working. Please help.
回答1:
I think you can simply do
import re
result = re.split(r"\s+(?:but|and)\s+", sentence)
where
`\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) `+` Between one and unlimited times, as many times as possible, giving back as needed (greedy) `(?:` Match the regular expression below, do not capture Match either the regular expression below (attempting the next alternative only if this one fails) `but` Match the characters "but" literally `|` Or match regular expression number 2 below (the entire group fails if this one fails to match) `and` Match the characters "and" literally ) `\s` Match a single character that is a "whitespace character" (spaces, tabs, line breaks, etc.) `+` Between one and unlimited times, as many times as possible, giving back as needed (greedy)
You can add more conjunction words in there separated by a pipe-character |
.
Take care though that these words do not contain characters that have special meaning in regex. If in doubt, escape them first with re.escape(word)
回答2:
If you want to avoid hardcoding conjunction words like 'but' and 'and', try chinking along with chunking:
import nltk
Digdug = nltk.RegexpParser(r"""
CHUNK_AND_CHINK:
{<.*>+} # Chunk everything
}<CC>+{ # Chink sequences of CC
""")
sentence = nltk.pos_tag(nltk.word_tokenize("There are no large collections present but there is spinal canal stenosis."))
result = Digdug.parse(sentence)
for subtree in result.subtrees(filter=lambda t: t.label() ==
'CHUNK_AND_CHINK'):
print (subtree)
Chinking basically excludes what we dont need from a chunk phrase - 'but' in this case. For more details , see: http://www.nltk.org/book/ch07.html
来源:https://stackoverflow.com/questions/52014482/chunking-sentences-using-the-word-but-with-regex