How to get all noun phrases in Spacy

后端 未结 3 1133
轮回少年
轮回少年 2021-02-14 17:06

I am new to Spacy and I would like to extract \"all\" the noun phrases from a sentence. I\'m wondering how I can do it. I have the following code:

i         


        
相关标签:
3条回答
  • 2021-02-14 17:32

    Please try this to get all nouns from a text:

    import spacy
    nlp = spacy.load("en_core_web_sm")
    text = ("We try to explicitly describe the geometry of the edges of the images.")
    doc = nlp(text)
    print([chunk.text for chunk in doc.noun_chunks])
    
    0 讨论(0)
  • 2021-02-14 17:33

    Please see commented code below to recursively combine the nouns. Code inspired by the Spacy Docs here

    import spacy
    
    nlp = spacy.load("en")
    
    doc = nlp("We try to explicitly describe the geometry of the edges of the images.")
    
    for np in doc.noun_chunks: # use np instead of np.text
        print(np)
    
    print()
    
    # code to recursively combine nouns
    # 'We' is actually a pronoun but included in your question
    # hence the token.pos_ == "PRON" part in the last if statement
    # suggest you extract PRON separately like the noun-chunks above
    
    index = 0
    nounIndices = []
    for token in doc:
        # print(token.text, token.pos_, token.dep_, token.head.text)
        if token.pos_ == 'NOUN':
            nounIndices.append(index)
        index = index + 1
    
    
    print(nounIndices)
    for idxValue in nounIndices:
        doc = nlp("We try to explicitly describe the geometry of the edges of the images.")
        span = doc[doc[idxValue].left_edge.i : doc[idxValue].right_edge.i+1]
        span.merge()
    
        for token in doc:
            if token.dep_ == 'dobj' or token.dep_ == 'pobj' or token.pos_ == "PRON":
                print(token.text)
    
    0 讨论(0)
  • 2021-02-14 17:44

    For every noun chunk you can also get the subtree beneath it. Spacy provides two ways to access that:left_edge and right edge attributes and the subtree attribute, which returns a Token iterator rather than a span. Combining noun_chunks and their subtree lead to some duplication which can be removed later.

    Here is an example using the left_edge and right edge attributes

    {np.text
      for nc in doc.noun_chunks
      for np in [
        nc, 
        doc[
          nc.root.left_edge.i
          :nc.root.right_edge.i+1]]}                                                                                                                                                                                                                                                                                                                                                                                                                                                 
    
    ==>
    
    {'We',
     'the edges',
     'the edges of the images',
     'the geometry',
     'the geometry of the edges of the images',
     'the images'}
    
    0 讨论(0)
提交回复
热议问题