How to get node contents from JDOM

大憨熊 提交于 2019-11-29 01:14:42

问题


I'm writing an application in java using import org.jdom.*;

My XML is valid,but sometimes it contains HTML tags. For example, something like this:

  <program-title>Anatomy &amp; Physiology</program-title>
  <overview>
       <content>
              For more info click <a href="page.html">here</a>
              <p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>
       </content>
  </overview>
  <key-information>
     <category>Health &amp; Human Services</category>

So my problem is with the < p > tags inside the overview.content node.

I was hoping that this code would work :

        Element overview = sds.getChild("overview");
        Element content = overview.getChild("content");

        System.out.println(content.getText());

but it returns blank.

How do I return all the text ( nested tags and all ) from the overview.content node ?

Thanks


回答1:


content.getText() gives immediate text which is only useful fine with the leaf elements with text content.

Trick is to use org.jdom.output.XMLOutputter ( with text mode CompactFormat )

public static void main(String[] args) throws Exception {
    SAXBuilder builder = new SAXBuilder();
    String xmlFileName = "a.xml";
    Document doc = builder.build(xmlFileName);

    Element root = doc.getRootElement();
    Element overview = root.getChild("overview");
    Element content = overview.getChild("content");

    XMLOutputter outp = new XMLOutputter();

    outp.setFormat(Format.getCompactFormat());
    //outp.setFormat(Format.getRawFormat());
    //outp.setFormat(Format.getPrettyFormat());
    //outp.getFormat().setTextMode(Format.TextMode.PRESERVE);

    StringWriter sw = new StringWriter();
    outp.output(content.getContent(), sw);
    StringBuffer sb = sw.getBuffer();
    System.out.println(sb.toString());
}

Output

For more info click<a href="page.html">here</a><p>Learn more about the human body. Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>

Do explore other formatting options and modify above code to your need.

"Class to encapsulate XMLOutputter format options. Typical users can use the standard format configurations obtained by getRawFormat() (no whitespace changes), getPrettyFormat() (whitespace beautification), and getCompactFormat() (whitespace normalization). "




回答2:


You could try using method getValue() for the closest approximation, but what this does is concatenate all text within the element and descendants together. This won't give you the <p> tag in any form. If that tag is in your XML like you've shown, it has become part of the XML markup. It'd need to be included as &lt;p&gt; or embedded in a CDATA section to be treated as text.

Alternatively, if you know all elements that either may or may not appear in your XML, you could apply an XSLT transformation that turns stuff which isn't intended as markup into plain text.




回答3:


Well, maybe that's what you need:

import java.io.StringReader;

import org.custommonkey.xmlunit.XMLTestCase;
import org.custommonkey.xmlunit.XMLUnit;
import org.jdom.input.SAXBuilder;
import org.jdom.output.XMLOutputter;
import org.testng.annotations.Test;
import org.xml.sax.InputSource;

public class HowToGetNodeContentsJDOM extends XMLTestCase
{
    private static final String XML = "<root>\n" + 
            "  <program-title>Anatomy &amp; Physiology</program-title>\n" + 
            "  <overview>\n" + 
            "       <content>\n" + 
            "              For more info click <a href=\"page.html\">here</a>\n" + 
            "              <p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>\n" + 
            "       </content>\n" + 
            "  </overview>\n" + 
            "  <key-information>\n" + 
            "     <category>Health &amp; Human Services</category>\n" + 
            "  </key-information>\n" + 
            "</root>";
    private static final String EXPECTED = "For more info click <a href=\"page.html\">here</a>\n" + 
            "<p>Learn more about the human body.  Choose from a variety of Physiology (A&amp;P) designed for complementary therapies.&amp;#160; Online studies options are available.</p>";

    @Test
    public void test() throws Exception
    {
        XMLUnit.setIgnoreWhitespace(true);
        Document document = new SAXBuilder().build(new InputSource(new StringReader(XML)));
        List<Content> content = document.getRootElement().getChild("overview").getChild("content").getContent();
        String out = new XMLOutputter().outputString(content);
        assertXMLEqual("<root>" + EXPECTED + "</root>", "<root>" + out + "</root>");
    }
}

Output:

PASSED: test on instance null(HowToGetNodeContentsJDOM)

===============================================
    Default test
    Tests run: 1, Failures: 0, Skips: 0
===============================================

I am using JDom with generics: http://www.junlu.com/list/25/883674.html

Edit: Actually that's not that much different from Prashant Bhate's answer. Maybe you need to tell us what you are missing...




回答4:


If you're also generating the XML file you should be able to encapsulate your html data in <![CDATA[]]> so that it isn't parsed by the XML parser.




回答5:


The problem is that the <content> node doesn't have a text child; it has a <p> child that happens to contain text.

Try this:

Element overview = sds.getChild("overview");
Element content = overview.getChild("content");
Element p = content.getChild("p");
System.out.println(p.getText());

If you want all the immediate child nodes, call p.getChildren(). If you want to get ALL the child nodes, you'll have to call it recursively.




回答6:


Not particularly pretty but works fine (using JDOM API):

public static String getRawText(Element element) {
    if (element.getContent().size() == 0) {
        return "";
    }

    StringBuffer text = new StringBuffer();
    for (int i = 0; i < element.getContent().size(); i++) {
        final Object obj = element.getContent().get(i);
        if (obj instanceof Text) {
            text.append( ((Text) obj).getText() );
        } else if (obj instanceof Element) {
            Element e = (Element) obj;
            text.append( "<" ).append( e.getName() );
            // dump all attributes
            for (Attribute attribute : (List<Attribute>)e.getAttributes()) {
                text.append(" ").append(attribute.getName()).append("=\"").append(attribute.getValue()).append("\"");
            }
            text.append(">");
            text.append( getRawText( e )).append("</").append(e.getName()).append(">");
        }
    }
    return text.toString();
}

Prashant Bhate's solution is nicer though!




回答7:


If you want to output the content of some JSOM node just use

System.out.println(new XMLOutputter().outputString(node))


来源:https://stackoverflow.com/questions/7910474/how-to-get-node-contents-from-jdom

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!