How to get node contents from JDOM

I'm writing an application in java using import org.jdom.*;

My XML is valid,but sometimes it contains HTML tags. For example, something like this:

  <program-title>Anatomy &amp; Physiology</program-title>
  <overview>
       <content>
              For more info click <a href="page.html">here</a>
              <p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>
       </content>
  </overview>
  <key-information>
     <category>Health &amp; Human Services</category>

So my problem is with the < p > tags inside the overview.content node.

I was hoping that this code would work :

        Element overview = sds.getChild("overview");
        Element content = overview.getChild("content");

        System.out.println(content.getText());

but it returns blank.

How do I return all the text ( nested tags and all ) from the overview.content node ?

Thanks

content.getText() gives immediate text which is only useful fine with the leaf elements with text content.

Trick is to use org.jdom.output.XMLOutputter ( with text mode CompactFormat )

public static void main(String[] args) throws Exception {
    SAXBuilder builder = new SAXBuilder();
    String xmlFileName = "a.xml";
    Document doc = builder.build(xmlFileName);

    Element root = doc.getRootElement();
    Element overview = root.getChild("overview");
    Element content = overview.getChild("content");

    XMLOutputter outp = new XMLOutputter();

    outp.setFormat(Format.getCompactFormat());
    //outp.setFormat(Format.getRawFormat());
    //outp.setFormat(Format.getPrettyFormat());
    //outp.getFormat().setTextMode(Format.TextMode.PRESERVE);

    StringWriter sw = new StringWriter();
    outp.output(content.getContent(), sw);
    StringBuffer sb = sw.getBuffer();
    System.out.println(sb.toString());
}

Output

For more info click<a href="page.html">here</a><p>Learn more about the human body. Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>

Do explore other formatting options and modify above code to your need.

"Class to encapsulate XMLOutputter format options. Typical users can use the standard format configurations obtained by getRawFormat() (no whitespace changes), getPrettyFormat() (whitespace beautification), and getCompactFormat() (whitespace normalization). "

You could try using method getValue() for the closest approximation, but what this does is concatenate all text within the element and descendants together. This won't give you the <p> tag in any form. If that tag is in your XML like you've shown, it has become part of the XML markup. It'd need to be included as <p> or embedded in a CDATA section to be treated as text.

Alternatively, if you know all elements that either may or may not appear in your XML, you could apply an XSLT transformation that turns stuff which isn't intended as markup into plain text.

Well, maybe that's what you need:

import java.io.StringReader;

import org.custommonkey.xmlunit.XMLTestCase;
import org.custommonkey.xmlunit.XMLUnit;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
import org.testng.annotations.Test;
import org.xml.sax.InputSource;

public class HowToGetNodeContentsJDOM extends XMLTestCase
{
    private static final String XML = "<root>\n" + 
            "  <program-title>Anatomy &amp; Physiology</program-title>\n" + 
            "  <overview>\n" + 
            "       <content>\n" + 
            "              For more info click <a href=\"page.html\">here</a>\n" + 
            "              <p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>\n" + 
            "       </content>\n" + 
            "  </overview>\n" + 
            "  <key-information>\n" + 
            "     <category>Health &amp; Human Services</category>\n" + 
            "  </key-information>\n" + 
            "</root>";
    private static final String EXPECTED = "For more info click <a href=\"page.html\">here</a>\n" + 
            "<p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>";

    @Test
    public void test() throws Exception
    {
        XMLUnit.setIgnoreWhitespace(true);
        Document document = new SAXBuilder().build(new InputSource(new StringReader(XML)));
        List<Content> content = document.getRootElement().getChild("overview").getChild("content").getContent();
        String out = new XMLOutputter().outputString(content);
        assertXMLEqual("<root>" + EXPECTED + "</root>", "<root>" + out + "</root>");
    }
}

Output:

PASSED: test on instance null(HowToGetNodeContentsJDOM)

===============================================
    Default test
    Tests run: 1, Failures: 0, Skips: 0
===============================================

I am using JDom with generics: http://www.junlu.com/list/25/883674.html

Edit: Actually that's not that much different from Prashant Bhate's answer. Maybe you need to tell us what you are missing...

If you're also generating the XML file you should be able to encapsulate your html data in <![CDATA[]]> so that it isn't parsed by the XML parser.

The problem is that the <content> node doesn't have a text child; it has a <p> child that happens to contain text.

Try this:

Element overview = sds.getChild("overview");
Element content = overview.getChild("content");
Element p = content.getChild("p");
System.out.println(p.getText());

If you want all the immediate child nodes, call p.getChildren(). If you want to get ALL the child nodes, you'll have to call it recursively.

Not particularly pretty but works fine (using JDOM API):

public static String getRawText(Element element) {
    if (element.getContent().size() == 0) {
        return "";
    }

    StringBuffer text = new StringBuffer();
    for (int i = 0; i < element.getContent().size(); i++) {
        final Object obj = element.getContent().get(i);
        if (obj instanceof Text) {
            text.append( ((Text) obj).getText() );
        } else if (obj instanceof Element) {
            Element e = (Element) obj;
            text.append( "<" ).append( e.getName() );
            // dump all attributes
            for (Attribute attribute : (List<Attribute>)e.getAttributes()) {
                text.append(" ").append(attribute.getName()).append("=\"").append(attribute.getValue()).append("\"");
            }
            text.append(">");
            text.append( getRawText( e )).append("</").append(e.getName()).append(">");
        }
    }
    return text.toString();
}

Prashant Bhate's solution is nicer though!

If you want to output the content of some JSOM node just use

System.out.println(new XMLOutputter().outputString(node))

来源：https://stackoverflow.com/questions/7910474/how-to-get-node-contents-from-jdom

标签

java

xml

xml-parsing

jdom