Normalization in DOM parsing with java - how does it work?

前端 未结 3 1874
一向
一向 2020-11-22 04:48

I saw the line below in code for a DOM parser at this tutorial.

doc.getDocumentElement().normalize();

Why do we do this normalization ?

相关标签:
3条回答
  • 2020-11-22 05:13

    As an extension to @JBNizet's answer for more technical users here's what implementation of org.w3c.dom.Node interface in com.sun.org.apache.xerces.internal.dom.ParentNode looks like, gives you the idea how it actually works.

    public void normalize() {
        // No need to normalize if already normalized.
        if (isNormalized()) {
            return;
        }
        if (needsSyncChildren()) {
            synchronizeChildren();
        }
        ChildNode kid;
        for (kid = firstChild; kid != null; kid = kid.nextSibling) {
             kid.normalize();
        }
        isNormalized(true);
    }
    

    It traverses all the nodes recursively and calls kid.normalize()
    This mechanism is overridden in org.apache.xerces.dom.ElementImpl

    public void normalize() {
         // No need to normalize if already normalized.
         if (isNormalized()) {
             return;
         }
         if (needsSyncChildren()) {
             synchronizeChildren();
         }
         ChildNode kid, next;
         for (kid = firstChild; kid != null; kid = next) {
             next = kid.nextSibling;
    
             // If kid is a text node, we need to check for one of two
             // conditions:
             //   1) There is an adjacent text node
             //   2) There is no adjacent text node, but kid is
             //      an empty text node.
             if ( kid.getNodeType() == Node.TEXT_NODE )
             {
                 // If an adjacent text node, merge it with kid
                 if ( next!=null && next.getNodeType() == Node.TEXT_NODE )
                 {
                     ((Text)kid).appendData(next.getNodeValue());
                     removeChild( next );
                     next = kid; // Don't advance; there might be another.
                 }
                 else
                 {
                     // If kid is empty, remove it
                     if ( kid.getNodeValue() == null || kid.getNodeValue().length() == 0 ) {
                         removeChild( kid );
                     }
                 }
             }
    
             // Otherwise it might be an Element, which is handled recursively
             else if (kid.getNodeType() == Node.ELEMENT_NODE) {
                 kid.normalize();
             }
         }
    
         // We must also normalize all of the attributes
         if ( attributes!=null )
         {
             for( int i=0; i<attributes.getLength(); ++i )
             {
                 Node attr = attributes.item(i);
                 attr.normalize();
             }
         }
    
        // changed() will have occurred when the removeChild() was done,
        // so does not have to be reissued.
    
         isNormalized(true);
     } 
    

    Hope this saves you some time.

    0 讨论(0)
  • 2020-11-22 05:16

    The rest of the sentence is:

    where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes.

    This basically means that the following XML element

    <foo>hello 
    wor
    ld</foo>
    

    could be represented like this in a denormalized node:

    Element foo
        Text node: ""
        Text node: "Hello "
        Text node: "wor"
        Text node: "ld"
    

    When normalized, the node will look like this

    Element foo
        Text node: "Hello world"
    

    And the same goes for attributes: <foo bar="Hello world"/>, comments, etc.

    0 讨论(0)
  • 2020-11-22 05:36

    In simple, Normalisation is Reduction of Redundancies.
    Examples of Redundancies:
    a) white spaces outside of the root/document tags(...<document></document>...)
    b) white spaces within start tag (<...>) and end tag (</...>)
    c) white spaces between attributes and their values (ie. spaces between key name and =")
    d) superfluous namespace declarations
    e) line breaks/white spaces in texts of attributes and tags
    f) comments etc...

    0 讨论(0)
提交回复
热议问题