问题
I intend to identify the sentence structure in English using spacy and textacy.
For example: The cat sat on the mat - SVO , The cat jumped and picked up the biscuit - SVV0. The cat ate the biscuit and cookies. - SVOO.
The program is supposed to read a paragraph and return the output for each sentence as SVO, SVOO, SVVO or other custom structures.
Efforts so far:
# -*- coding: utf-8 -*-
#!/usr/bin/env python
from __future__ import unicode_literals
# Load Library files
import en_core_web_sm
import spacy
import textacy
nlp = en_core_web_sm.load()
SUBJ = ["nsubj","nsubjpass"]
VERB = ["ROOT"]
OBJ = ["dobj", "pobj", "dobj"]
text = nlp(u'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.')
sub_toks = [tok for tok in text if (tok.dep_ in SUBJ) ]
obj_toks = [tok for tok in text if (tok.dep_ in OBJ) ]
vrb_toks = [tok for tok in text if (tok.dep_ in VERB) ]
text_ext = list(textacy.extract.subject_verb_object_triples(text))
print("Subjects:", sub_toks)
print("VERB :", vrb_toks)
print("OBJECT(s):", obj_toks)
print ("SVO:", text_ext)
Output:
(u'Subjects:', [cat, cat, cat])
(u'VERB :', [sat, jumped, ate])
(u'OBJECT(s):', [mat, biscuit, biscuit])
(u'SVO:', [(cat, ate, biscuit), (cat, ate, cookies)])
- Issue 1: The SVO are overwritten. Why?
- Issue 2: How to identify the sentence as
SVOO SVO SVVO
etc.?
Edit 1:
Some approach I was conceptualizing.
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'I will go to the mall.'
doc = nlp(sentence)
chk_set = set(['PRP','MD','NN'])
result = chk_set.issubset(t.tag_ for t in doc)
if result == False:
print "SVO not identified"
elif result == True: # shouldn't do this
print "SVO"
else:
print "Others..."
Edit 2:
Made further inroads
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. The cat jumped and picked up the biscuit. The cat ate biscuit and cookies.'
doc = nlp(sentence)
print(" ".join([token.dep_ for token in doc]))
Current output:
det nsubj ROOT prep det pobj punct det nsubj ROOT cc conj prt det dobj punct det nsubj ROOT dobj cc conj punct
Expected output:
SVO SVVO SVOO
Idea is to break down dependency tags to simple subject-verb and object model.
Thinking of achieving it with regex if no other options are available. But that is my last option.
Edit 3:
After studying this link, got some improvement.
def testSVOs():
nlp = en_core_web_sm.load()
tok = nlp("The cat sat on the mat. The cat jumped for the biscuit. The cat ate biscuit and cookies.")
svos = findSVOs(tok)
print(svos)
Current output:
[(u'cat', u'sat', u'mat'), (u'cat', u'jumped', u'biscuit'), (u'cat', u'ate', u'biscuit'), (u'cat', u'ate', u'cookies')]
Expected output:
I am expecting a notation for the sentences. Although I'm able to extract the SVO on how to convert it into SVO notation. It is more of pattern identification rather than the sentence content itself.
SVO SVO SVOO
回答1:
Issue 1: The SVO are overwritten. Why?
This is textacy
issue. This part is not working very well, see this blog
Issue 2: How to identify the sentence as SVOO SVO SVVO etc.?
You should parse the dependency tree. SpaCy
provides the information, you just need to write a set of rules to extract it out, using .head
, .left
, .right
and .children
attributes.
>>for word in text:
print('%10s %5s %10s %10s %s'%(word.text, word.tag_, word.dep_, word.pos_, word.head.text_))
The DT det DET cat
cat NN nsubj NOUN sat
sat VBD ROOT VERB sat
on IN prep ADP sat
the DT det DET mat
mat NN pobj NOUN on
. . punct PUNCT sat
of IN ROOT ADP of
the DT det DET lab
art NN compound NOUN lab
lab NN pobj NOUN of
. . punct PUNCT of
The DT det DET cat
cat NN nsubj NOUN jumped
jumped VBD ROOT VERB jumped
and CC cc CCONJ jumped
picked VBD conj VERB jumped
up RP prt PART picked
the DT det DET biscuit
biscuit NN dobj NOUN picked
. . punct PUNCT jumped
The DT det DET cat
cat NN nsubj NOUN ate
ate VBD ROOT VERB ate
biscuit NN dobj NOUN ate
and CC cc CCONJ biscuit
cookies NNS conj NOUN biscuit
. . punct PUNCT ate
I recommend you look at this code, just add pobj
to the list of OBJECTS
, and you will get your SVO and SVOO covered. With a little fiddling you can get SVVO also.
来源:https://stackoverflow.com/questions/49460078/sentence-structure-identification-spacy