Spacy to Conll format without using Spacy's sentence splitter

问题

This post shows how to get dependencies of a block of text in Conll format with Spacy's taggers. This is the solution posted:

import spacy
nlp_en = spacy.load('en')
doc = nlp_en(u'Bob bought the pizza to Alice')
for sent in doc.sents:
        for i, word in enumerate(sent):
              if word.head == word:
                 head_idx = 0
              else:
                 head_idx = word.head.i - sent[0].i + 1
              print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
                 i+1, # There's a word.i attr that's position in *doc*
                  word,
                  word.lemma_,
                  word.tag_, # Fine-grained tag
                  word.ent_type_,
                  str(head_idx),
                  word.dep_ # Relation
                 ))

It outputs this block:

1   Bob bob NNP PERSON  2   nsubj
2   bought  buy VBD     0   ROOT
3   the the DT      4   det
4   pizza   pizza   NN      2   dobj
5   to  to  IN      2   dative
6   Alice   alice   NNP PERSON  5   pobj

I would like to get the same output WITHOUT using doc.sents.

Indeed, I have my own sentence-splitter. I would like to use it, and then give Spacy one sentence at a time to get POS, NER, and dependencies.

How can I get POS, NER, and dependencies of one sentence in Conll format with Spacy without having to use Spacy's sentence splitter ?

回答1:

A Document in sPacy is iterable, and in the documentation is states that it iterates over Tokens

 |  __iter__(...)
 |      Iterate over `Token`  objects, from which the annotations can be
 |      easily accessed. This is the main way of accessing `Token` objects,
 |      which are the main way annotations are accessed from Python. If faster-
 |      than-Python speeds are required, you can instead access the annotations
 |      as a numpy array, or access the underlying C data directly from Cython.
 |      
 |      EXAMPLE:
 |          >>> for token in doc

Therefore I believe you would just have to make a Document for each of your sentences that are split, then do something like the following:

def printConll(split_sentence_text):
    doc = nlp(split_sentence_text)
    for i, word in enumerate(doc):
          if word.head == word:
             head_idx = 0
          else:
             head_idx = word.head.i - sent[0].i + 1
          print("%d\t%s\t%s\t%s\t%s\t%s\t%s"%(
             i+1, # There's a word.i attr that's position in *doc*
              word,
              word.lemma_,
              word.tag_, # Fine-grained tag
              word.ent_type_,
              str(head_idx),
              word.dep_ # Relation
             ))

Of course, following the CoNLL format you would have to print a newline after each sentence.

回答2:

This post is about a user facing unexpected sentence breaks from using the spacy sentence boundary detection. One of the solutions proposed by the developers at Spacy (as on the post) is to add flexibility to add ones own sentence boundary detection rules. This problem is solved in conjunction with dependency parsing by Spacy, not before it. Therefore, I don't think what you're looking for is supported at all by Spacy at the moment, though it might be in the near future.

回答3:

@ashu 's answer is partly right: dependency parsing and sentence boundary detection are tightly coupled by design in spaCy. Though there is a simple sentencizer.

https://spacy.io/api/sentencizer

It seems the sentecizer just uses punctuation (not the perfect way). But if such sentencizer exists then you can create a custom one using your rules and it will affect sentence boundaries for sure.

来源：https://stackoverflow.com/questions/47818504/spacy-to-conll-format-without-using-spacys-sentence-splitter

标签

python-2.7

dependencies

customization

spacy