Extract field list from reStructuredText

前端 未结 3 1351
北海茫月
北海茫月 2021-01-05 10:56

Say I have the following reST input:

Some text ...

:foo: bar

Some text ...

What I would like to end up with is a dict like this:

相关标签:
3条回答
  • 2021-01-05 11:40

    You can try to use something like the following code. Rather than using the publish_parts method I have used publish_doctree, to get the pseudo-XML representation of your document. I have then converted to an XML DOM in order to extract all the field elements. Then I get the first field_name and field_body elements of each field element.

    from docutils.core import publish_doctree
    
    source = """Some text ...
    
    :foo: bar
    
    Some text ...
    """
    
    # Parse reStructuredText input, returning the Docutils doctree as
    # an `xml.dom.minidom.Document` instance.
    doctree = publish_doctree(source).asdom()
    
    # Get all field lists in the document.
    fields = doctree.getElementsByTagName('field')
    
    d = {}
    
    for field in fields:
        # I am assuming that `getElementsByTagName` only returns one element.
        field_name = field.getElementsByTagName('field_name')[0]
        field_body = field.getElementsByTagName('field_body')[0]
    
        d[field_name.firstChild.nodeValue] = \
            " ".join(c.firstChild.nodeValue for c in field_body.childNodes)
    
    print d # Prints {u'foo': u'bar'}
    

    The xml.dom module isn't the easiest to work with (why do I need to use .firstChild.nodeValue rather than just .nodeValue for example), so you may wish to use the xml.etree.ElementTree module, which I find a lot easier to work with. If you use lxml you can also use XPATH notation to find all of the field, field_name and field_body elements.

    0 讨论(0)
  • 2021-01-05 11:53

    I have an alternative solution that I find to be less of a burden, but maybe more brittle. After reviewing the implementation of the node class https://sourceforge.net/p/docutils/code/HEAD/tree/trunk/docutils/docutils/nodes.py you will see that it supports a walk method that can be used to pull out the wanted data without having to create two different xml representations of your data. Here is what I am using now, in my protoype code:

    https://github.com/h4ck3rm1k3/gcc-introspector/blob/master/peewee_adaptor.py#L33

    from docutils.core import publish_doctree
    import docutils.nodes
    

    and then

    def walk_docstring(prop):
        doc = prop.__doc__
        doctree = publish_doctree(doc)
        class Walker:
            def __init__(self, doc):
                self.document = doc
                self.fields = {}
            def dispatch_visit(self,x):
                if isinstance(x, docutils.nodes.field):
                    field_name = x.children[0].rawsource
                    field_value = x.children[1].rawsource
                    self.fields[field_name]=field_value
        w = Walker(doctree)
        doctree.walk(w)
        # the collected fields I wanted
        pprint.pprint(w.fields)
    
    0 讨论(0)
  • 2021-01-05 11:57

    Here's my ElementTree implementation:

    from docutils.core import publish_doctree
    from xml.etree.ElementTree import fromstring
    
    source = """Some text ...
    
    :foo: bar
    
    Some text ...
    """
    
    
    def gen_fields(source):
        dom = publish_doctree(source).asdom()
        tree = fromstring(dom.toxml())
    
        for field in tree.iter(tag='field'):
            name = next(field.iter(tag='field_name'))
            body = next(field.iter(tag='field_body'))
            yield {name.text: ''.join(body.itertext())}
    

    Usage

    >>> next(gen_fields(source))
    {'foo': 'bar'}
    
    0 讨论(0)
提交回复
热议问题