问题
I'm trying to annotate a corpus of plain text. I'm working with systemic functional grammar, which is fairly standard in terms of part-of-speech annotation, but differs in terms of phrases/chunks.
Accordingly, I've POS tagged my data with NLTK defaults, and made a regex chunker with nltk.RegexpParser
. Basically, the output now is an NLTK-style phrase structure tree:
Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is', 'VBZ')]), Tree('Participant', [('a', 'DT'), ('representation', 'NN')]), Tree('Circumstance', [('of', 'IN'), ('the', 'DT'), ('grammar', 'NN')])])]), ('.', '.')])
There is some stuff I want to manually annotate on top of this, however: the systemic grammar breaks down participants and verbal groups into sub-types that probably can't be automatically annotated. So, I was hoping to convert the parse tree format into something an annotation tool (preferably BRAT) could handle, and then go through the text and specify the sub-types manually, as in (one possible solution):
Perhaps the solution would be sort of tricking BRAT into treating the phrase structure like dependencies? I could modify the chunking regex if need be. Are there any converters out there? (Brat provides ways of converting from CONLL2000 and Stanford Core NLP, so if I could get the phrase structure into either of those forms it would be acceptable too.)
Thanks!
回答1:
Representing a non-binary tree as arcs will be difficult, but it is possible to nest "entity" annotations and use this for a constituency parse structure. Note that I'm not creating nodes for the terminals (part of speech tags) of the tree, partially because Brat is not currently good at displaying unary rules that often apply to terminals. The description of the target format is found here.
Firstly, we need a function to produce standoff annotations. While Brat seeks standoff in terms of characters, in the following we just use token offsets, and will convert to characters below.
(Note this uses NLTK 3.0b and Python 3)
def _standoff(path, leaves, slices, offset, tree):
width = 0
for i, child in enumerate(tree):
if isinstance(child, tuple):
tok, tag = child
leaves.append(tok)
width += 1
else:
path.append(i)
width += _standoff(path, leaves, slices, offset + width, child)
path.pop()
slices.append((tuple(path), tree.label(), offset, offset + width))
return width
def standoff(tree):
leaves = []
slices = []
_standoff([], leaves, slices, 0, tree)
return leaves, slices
Applying this to your example:
>>> from nltk.tree import Tree
>>> tree = Tree('S', [Tree('Clause', [Tree('Process-dependencies', [Tree('Participant', [('This', 'DT')]), Tree('Verbal-group', [('is', 'VBZ')]), Tree('Participant', [('a', 'DT'), ('representation', 'NN')]), Tree('Circumstance', [('of', 'IN'), ('the', 'DT'), ('grammar', 'NN')])])]), ('.', '.')])
>>> standoff(tree)
(['This', 'is', 'a', 'representation', 'of', 'the', 'grammar', '.'],
[((0, 0, 0), 'Participant', 0, 1),
((0, 0, 1), 'Verbal-group', 1, 2),
((0, 0, 2), 'Participant', 2, 4),
((0, 0, 3), 'Circumstance', 4, 7),
((0, 0), 'Process-dependencies', 0, 7),
((0,), 'Clause', 0, 7),
((), 'S', 0, 8)])
This returns the leaf tokens, then a list of tuples corresponding subtrees with elements: (index into root, label, start leaf, stop leaf).
To convert this into character standoff:
def char_standoff(tree):
leaves, tok_standoff = standoff(tree)
text = ' '.join(leaves)
# Map leaf index to its start and end character
starts = []
offset = 0
for leaf in leaves:
starts.append(offset)
offset += len(leaf) + 1
starts.append(offset)
return text, [(path, label, starts[start_tok], starts[end_tok] - 1)
for path, label, start_tok, end_tok in tok_standoff]
Then:
>>> char_standoff(tree)
('This is a representation of the grammar .',
[((0, 0, 0), 'Participant', 0, 4),
((0, 0, 1), 'Verbal-group', 5, 7),
((0, 0, 2), 'Participant', 8, 24),
((0, 0, 3), 'Circumstance', 25, 39),
((0, 0), 'Process-dependencies', 0, 39),
((0,), 'Clause', 0, 39),
((), 'S', 0, 41)])
Finally, we can write a function that converts this to Brat's format:
def write_brat(tree, filename_prefix):
text, standoff = char_standoff(tree)
with open(filename_prefix + '.txt', 'w') as f:
print(text, file=f)
with open(filename_prefix + '.ann', 'w') as f:
for i, (path, label, start, stop) in enumerate(standoff):
print('T{}'.format(i), '{} {} {}'.format(label, start, stop), text[start:stop], sep='\t', file=f)
This writes the following to /path/to/something.txt:
This is a representation of the grammar .
and this to /path/to/something.ann:
T0 Participant 0 4 This
T1 Verbal-group 5 7 is
T2 Participant 8 24 a representation
T3 Circumstance 25 39 of the grammar
T4 Process-dependencies 0 39 This is a representation of the grammar
T5 Clause 0 39 This is a representation of the grammar
T6 S 0 41 This is a representation of the grammar .
来源:https://stackoverflow.com/questions/23146072/converting-nltk-phrase-structure-trees-to-brat-ann-standoff