Preserving original doctype and declaration of an lxml.etree parsed xml

后端 未结 2 567
情话喂你
情话喂你 2021-01-01 18:28

I\'m using python\'s lxml and I\'m trying to read an xml document, modify and write it back but the original doctype and xml declaration disappears. I\'m wondering if there\

相关标签:
2条回答
  • 2021-01-01 19:11

    You can also preserve DOCTYPE and the XML declaration with fromstring():

    import sys
    from StringIO import StringIO
    from lxml import etree
    
    xml = r'''<?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
     <head>
     <title>example</title>
     </head>
     <body>
     <p>This is an example</p>
     </body>
    </html>'''
    
    tree = etree.fromstring(xml).getroottree() # or etree.parse(file)
    tree.write(sys.stdout, xml_declaration=True, encoding=tree.docinfo.encoding)
    

    Output

    <?xml version='1.0' encoding='UTF-8'?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
     <head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
     <title>example</title>
     </head>
     <body>
     <p>This is an example</p>
     </body>
    </html>
    

    Note the xml declaration (with correct encoding) and doctype are present. It even (possibly incorrectly) uses ' instead of " in the xml declaration and adds Content-Type to the <head>.

    For the @John Keyes' example input it produces the same results as etree.tostring() in the answer.

    0 讨论(0)
  • 2021-01-01 19:13

    tl;dr

    # adds declaration with version and encoding regardless of
    # which attributes were present in the original declaration
    # expects utf-8 encoding (encode/decode calls)
    # depending on your needs you might want to improve that
    from lxml import etree
    from xml.dom.minidom import parseString
    xml1 = '''\
    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE root SYSTEM "example.dtd">
    <root>...</root>
    '''
    xml2 = '''\
    <root>...</root>
    '''
    def has_xml_declaration(xml):
        return parseString(xml).version
    def process(xml):
        t = etree.fromstring(xml.encode()).getroottree()
        if has_xml_declaration(xml):
            print(etree.tostring(t, xml_declaration=True, encoding=t.docinfo.encoding).decode())
        else:
            print(etree.tostring(t).decode())
    process(xml1)
    process(xml2)
    

    The following will include the DOCTYPE and the XML declaration:

    from lxml import etree
    from StringIO import StringIO
    
    tree = etree.parse(StringIO('''<?xml version="1.0" encoding="iso-8859-1"?>
     <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
      <root>
       <a>&tasty;</a>
     </root>
    '''))
    
    docinfo = tree.docinfo
    print etree.tostring(tree, xml_declaration=True, encoding=docinfo.encoding)
    

    Note, tostring does not preserve the DOCTYPE if you create an Element (e.g. using fromstring), it only works when you process the XML using parse.

    Update: as pointed out by J.F. Sebastian my assertion about fromstring is not true.

    Here is some code to highlight the differences between Element and ElementTree serialization:

    from lxml import etree
    from StringIO import StringIO
    
    xml_str = '''<?xml version="1.0" encoding="iso-8859-1"?>
     <!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
      <root>
       <a>&tasty;</a>
     </root>
    '''
    
    # get the ElementTree using parse
    parse_tree = etree.parse(StringIO(xml_str))
    encoding = parse_tree.docinfo.encoding
    result = etree.tostring(parse_tree, xml_declaration=True, encoding=encoding)
    print "%s\nparse ElementTree:\n%s\n" % ('-'*20, result)
    
    # get the ElementTree using fromstring
    fromstring_tree = etree.fromstring(xml_str).getroottree()
    encoding = fromstring_tree.docinfo.encoding
    result = etree.tostring(fromstring_tree, xml_declaration=True, encoding=encoding)
    print "%s\nfromstring ElementTree:\n%s\n" % ('-'*20, result)
    
    # DOCTYPE is lost, and no access to encoding
    fromstring_element = etree.fromstring(xml_str)
    result = etree.tostring(fromstring_element, xml_declaration=True)
    print "%s\nfromstring Element:\n%s\n" % ('-'*20, result)
    

    and the output is:

    --------------------
    parse ElementTree:
    <?xml version='1.0' encoding='iso-8859-1'?>
    <!DOCTYPE root SYSTEM "test" [
    <!ENTITY tasty "eggs">
    ]>
    <root>
       <a>eggs</a>
     </root>
    
    --------------------
    fromstring ElementTree:
    <?xml version='1.0' encoding='iso-8859-1'?>
    <!DOCTYPE root SYSTEM "test" [
    <!ENTITY tasty "eggs">
    ]>
    <root>
       <a>eggs</a>
     </root>
    
    --------------------
    fromstring Element:
    <?xml version='1.0' encoding='ASCII'?>
    <root>
       <a>eggs</a>
     </root>
    
    0 讨论(0)
提交回复
热议问题