Define default namespace (unprefixed) in lxml

问题

When rendering XHTML with lxml, everything is fine, unless you happen to use Firefox, which seems unable to deal with namespace-prefixed XHTML elements and javascript. While Opera is able to execute the javascript (this applies to both jQuery and MathJax) fine, no matter whether the XHTML namespace has a prefix (h: in my case) or not, in Firefox the scripts will abort with weird errors (this.head is undefined in the case of MathJax).

I know about the register_namespace function, but it does neither accept None nor "" as namespace prefix. I've heard about _namespace_map in the lxml.etree module, but my Python complains that this attribute doesn't exist (version issues?)

Is there any other way removing the namespace prefix for the XHTML namespace? Note that str.replace, as suggested in the answer to another, related question, is not a method I could accept, as it is not aware of XML semantics and might easily screw up the resulting document.

As per request, you'll find two examples ready to use. One with namespace prefixes and one without. The first one will display 0 in Firefox (wrong) and the second one will display 1 (correct). Opera will render both correct. This is obviously a Firefox bug, but this only serves as a rationale for wanting prefixless XHTML with lxml – there are other good reasons as to reduce traffic for mobile clients etc (even h: is quite a lot if you consider tens or hundret of html tags).

回答1:

This XSL transformation removes all prefixes from content, while maintaining namespaces defined in the root node:

import lxml.etree as ET

content = '''\
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<h:html xmlns:h="http://www.w3.org/1999/xhtml" xmlns:ml="http://foo">
  <h:head>
    <h:title>MathJax Test Page</h:title>
    <h:script type="text/javascript"><![CDATA[
      function test() {
        alert(document.getElementsByTagName("p").length);
      };
    ]]></h:script>
  </h:head>
  <h:body onload="test();">
    <h:p>test</h:p>
    <ml:foo></ml:foo>
  </h:body>
</h:html>
'''
dom = ET.fromstring(content)

xslt = '''\
<xsl:stylesheet version="1.0"
    xmlns="http://www.w3.org/1999/xhtml"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" indent="no"/>

<!-- identity transform for everything else -->
<xsl:template match="/|comment()|processing-instruction()|*|@*">
    <xsl:copy>
      <xsl:apply-templates />
    </xsl:copy>
</xsl:template>

<!-- remove NS from XHTML elements -->
<xsl:template match="*[namespace-uri() = 'http://www.w3.org/1999/xhtml']">
    <xsl:element name="{local-name()}">
      <xsl:apply-templates select="@*|node()" />
    </xsl:element>
</xsl:template>

<!-- remove NS from XHTML attributes -->
<xsl:template match="@*[namespace-uri() = 'http://www.w3.org/1999/xhtml']">
    <xsl:attribute name="{local-name()}">
      <xsl:value-of select="." />
    </xsl:attribute>
</xsl:template>
</xsl:stylesheet>
'''

xslt_doc = ET.fromstring(xslt)
transform = ET.XSLT(xslt_doc)
dom = transform(dom)

print(ET.tostring(dom, pretty_print = True, 
                  encoding = 'utf-8'))

yields

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>MathJax Test Page</title>
    <script type="text/javascript">
      function test() {
        alert(document.getElementsByTagName("p").length);
      };
    </script>
  </head>
  <body onload="test();">
    <p>test</p>
    <ml:foo xmlns:ml="http://foo"/>
  </body>
</html>

回答2:

Use ElementMaker and give it an nsmap that maps None to your default namespace.

#!/usr/bin/env python
# dogeml.py

from lxml.builder import ElementMaker
from lxml import etree

E = ElementMaker(
    nsmap={
        None: "http://wow/"    # <--- This is the special sauce
    }
)

doge = E.doge(
    E.such('markup'),
    E.many('very namespaced', syntax="tricks")
)

options = {
    'pretty_print': True,
    'xml_declaration': True,
    'encoding': 'UTF-8',
}

serialized_bytes = etree.tostring(doge, **options)
print(serialized_bytes.decode(options['encoding']))

As you can see in the output from this script, the default namespace is defined, but the tags do not have a prefix.

<?xml version='1.0' encoding='UTF-8'?>
<doge xmlns="http://wow/">
   <such>markup</such>
   <many syntax="tricks">very namespaced</many>
</doge>

I have tested this code with Python 2.7.6, 3.3.5, and 3.4.0, combined with lxml 3.3.1.

回答3:

To expand on @neirbowj's answer, but using ET.Element and ET.SubElement, and rendering a document with a mix of namespaces, where the root happens to be explicitly namespaced and a subelement (channel) is the default namespace:

# I set up but don't use the default namespace:
root = ET.Element('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF', nsmap={None: 'http://purl.org/rss/1.0/'})
# I use the default namespace by including its URL in curly braces:
e = ET.SubElement(root, '{http://purl.org/rss/1.0/}channel')
print(ET.tostring(root, xml_declaration=True, encoding='utf8').decode())

This will print out the following:

<?xml version='1.0' encoding='utf8'?>
<rdf:RDF xmlns="http://purl.org/rss/1.0/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"><channel/></rdf:RDF>

It automatically uses rdf for the RDF namespace. I'm not sure how it figures it out. If I want to specify it I can add it to my nsmap in the root element:

nsmap = {None: 'http://purl.org/rss/1.0/',
         'doge': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'}
root = ET.Element('{http://www.w3.org/1999/02/22-rdf-syntax-ns#}RDF', nsmap=nsmap)
e = ET.SubElement(root, '{http://purl.org/rss/1.0/}channel')
print(ET.tostring(root, xml_declaration=True, encoding='utf8').decode())

...and I get this:

<?xml version='1.0' encoding='utf8'?>
<doge:RDF xmlns:doge="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://purl.org/rss/1.0/"><channel/></doge:RDF>

来源：https://stackoverflow.com/questions/13591707/define-default-namespace-unprefixed-in-lxml

标签

python

xslt

xhtml

namespaces

lxml