How to extract subjects in a sentence and their respective dependent phrases?

后端 未结 2 1638
逝去的感伤
逝去的感伤 2021-01-30 06:06

I am trying to work on subject extraction in a sentence, so that I can get the sentiments in accordance with the subject. I am using nltk in python2.7 for this purp

2条回答
  •  情歌与酒
    2021-01-30 06:30

    I was recently just solving very similar problem - I needed to extract subject(s), action, object(s). And I open sourced my work so you can check this library: https://github.com/krzysiekfonal/textpipeliner

    This based on spacy(opponent to nltk) but it also based on sentence tree.

    So for instance let's get this doc embedded in spacy as example:

    import spacy
    nlp = spacy.load("en")
    doc = nlp(u"The Empire of Japan aimed to dominate Asia and the " \
                   "Pacific and was already at war with the Republic of China " \
                   "in 1937, but the world war is generally said to have begun on " \
                   "1 September 1939 with the invasion of Poland by Germany and " \
                   "subsequent declarations of war on Germany by France and the United Kingdom. " \
                   "From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered " \
                   "or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. " \
                   "Under the Molotov-Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and " \
                   "annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. " \
                   "The war continued primarily between the European Axis powers and the coalition of the United Kingdom " \
                   "and the British Commonwealth, with campaigns including the North Africa and East Africa campaigns, " \
                   "the aerial Battle of Britain, the Blitz bombing campaign, the Balkan Campaign as well as the " \
                   "long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion " \
                   "of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part " \
                   "of the Axis' military forces into a war of attrition. In December 1941, Japan attacked " \
                   "the United States and European territories in the Pacific Ocean, and quickly conquered much of " \
                   "the Western Pacific.")
    

    You can now create a simple pipes structure(more about pipes in readme of this project):

    pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/*"),
                                     NamedEntityFilterPipe(),
                                     NamedEntityExtractorPipe()]),
                       FindTokensPipe("VERB"),
                       AnyPipe([SequencePipe([FindTokensPipe("VBD/dobj/NNP"),
                                              AggregatePipe([NamedEntityFilterPipe("GPE"), 
                                                    NamedEntityFilterPipe("PERSON")]),
                                              NamedEntityExtractorPipe()]),
                                SequencePipe([FindTokensPipe("VBD/**/*/pobj/NNP"),
                                              AggregatePipe([NamedEntityFilterPipe("LOC"), 
                                                    NamedEntityFilterPipe("PERSON")]),
                                              NamedEntityExtractorPipe()])])]
    
    engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
    engine.process()
    

    And in the result you will get:

    >>>[([Germany], [conquered], [Europe]),
     ([Japan], [attacked], [the, United, States])]
    

    Actually it based strongly (the finding pipes) on another library - grammaregex. You can read about it from a post: https://medium.com/@krzysiek89dev/grammaregex-library-regex-like-for-text-mining-49e5706c9c6d#.zgx7odhsc

    EDITED

    Actually the example I presented in readme discards adj, but all you need is to adjust pipe structure passed to engine according to your needs. For instance for your sample sentences I can propose such structure/solution which give you tuple of 3 elements(subj, verb, adj) per every sentence:

    import spacy
    from textpipeliner import PipelineEngine, Context
    from textpipeliner.pipes import *
    
    pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/NNP"),
                                     NamedEntityFilterPipe(),
                                     NamedEntityExtractorPipe()]),
                           AggregatePipe([FindTokensPipe("VERB"),
                                          FindTokensPipe("VERB/xcomp/VERB/aux/*"),
                                          FindTokensPipe("VERB/xcomp/VERB")]),
                           AnyPipe([FindTokensPipe("VERB/[acomp,amod]/ADJ"),
                                    AggregatePipe([FindTokensPipe("VERB/[dobj,attr]/NOUN/det/DET"),
                                                   FindTokensPipe("VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
                          ]
    
    engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
    engine.process()
    

    It will give you result:

    [([Donald, Trump], [is], [the, worst])]
    

    A little bit complexity is in the fact you have compound sentence and the lib produce one tuple per sentence - I'll soon add possibility(I need it too for my project) to pass a list of pipe structures to engine to allow produce more tuples per sentence. But for now you can solve it just by creating second engine for compounded sents which structure will differ only of VERB/conj/VERB instead of VERB(those regex starts always from ROOT, so VERB/conj/VERB lead you to just second verb in compound sentence):

    pipes_structure_comp = [SequencePipe([FindTokensPipe("VERB/conj/VERB/nsubj/NNP"),
                                     NamedEntityFilterPipe(),
                                     NamedEntityExtractorPipe()]),
                       AggregatePipe([FindTokensPipe("VERB/conj/VERB"),
                                      FindTokensPipe("VERB/conj/VERB/xcomp/VERB/aux/*"),
                                      FindTokensPipe("VERB/conj/VERB/xcomp/VERB")]),
                       AnyPipe([FindTokensPipe("VERB/conj/VERB/[acomp,amod]/ADJ"),
                                AggregatePipe([FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/det/DET"),
                                               FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
                      ]
    
    engine2 = PipelineEngine(pipes_structure_comp, Context(doc), [0,1,2])
    

    And now after you run both engines you will get expected result :)

    engine.process()
    engine2.process()
    [([Donald, Trump], [is], [the, worst])]
    [([Hillary], [is], [better])]
    

    This is what you need I think. Of course I just quickly created a pipe structure for given example sentence and it won't work for every case, but I saw a lot of sentence structures and it will already fulfil quite nice percentage, but then you can just add more FindTokensPipe etc for cases which won't work currently and I'm sure after a few adjustment you will cover really good number of possible sentences(english is not too complex so...:)

提交回复
热议问题