So I am trying to get the text between the tags. So far I have been successful. But sometimes when there are special characters or html tags inside my custom tags I am unable to get the text. The sample xml looks like
<records>
<car name='HSV Maloo' make='Holden' year='2006'>
<ae_definedTermTitleBegin />Australia<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />1.02 <u>Accounting Terms</u>.<ae_clauseTitleEnd />
</car>
<car name='P50' make='Peel' year='1962'>
<ae_definedTermTitleBegin />Isle of Man<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />Smallest Street-Legal Car at 99cm wide and 59 kg in weight<ae_clauseTitleEnd />
</car>
<car name='Royale' make='Bugatti' year='1931'>
<ae_definedTermTitleBegin />France<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />Most Valuable Car at $15 million<ae_clauseTitleEnd />
</car>
</records>
The output that I am getting is
[Australia, Isle of Man, France]
[., Smallest Street-Legal Car at 99cm wide and 59 kg in weight, Most Valuable Car at $15 million]
As you can seen that 'Accounting Terms' is missing. All I get is a dot. How do I correct this?
The sax parser code
import javax.xml.parsers.SAXParserFactory
import org.xml.sax.helpers.DefaultHandler
import org.xml.sax.*
class SAXXMLParser extends DefaultHandler {
def DefinedTermTitles = []
def ClauseTitles = []
def currentMessage
def countryFlag = false
void startElement(String ns, String localName, String qName, Attributes atts) {
switch (qName) {
case 'ae_clauseTitleBegin':
//messages.add(currentMessage)
countryFlag = true;
break
case 'ae_definedTermTitleBegin':
//messages.add(currentMessage)
countryFlag = true;
break
}
}
void characters(char[] chars, int offset, int length) {
if (countryFlag) {
currentMessage = new String(chars, offset, length)
println(currentMessage)
}
}
void endElement(String ns, String localName, String qName) {
switch (qName) {
case 'ae_clauseTitleEnd':
ClauseTitles.add(currentMessage)
countryFlag = false;
break
case 'ae_definedTermTitleEnd':
DefinedTermTitles.add(currentMessage)
countryFlag = false;
break
}
}
}
I'm not familiar with Groovy so here is a solution in Java. I believe the translation is straighforward.
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.helpers.DefaultHandler;
public class SaxHandler extends DefaultHandler {
ArrayList<String> DefinedTermTitles = new ArrayList<>();
ArrayList<String> ClauseTitles = new ArrayList<>();
String currentMessage;
boolean countryFlag = false;
StringBuilder message = new StringBuilder();
public void startElement(String ns, String localName, String qName, Attributes atts) {
switch (qName) {
case "ae_clauseTitleBegin":
countryFlag = true;
break;
case "ae_definedTermTitleBegin":
countryFlag = true;
break;
}
}
public void characters(char[] chars, int offset, int length) {
if (countryFlag) {
message.append(new String(chars, offset, length));
}
}
public void endElement(String ns, String localName, String qName) {
switch (qName) {
case "ae_clauseTitleEnd":
ClauseTitles.add(message.toString());
countryFlag = false;
message.setLength(0);
break;
case "ae_definedTermTitleEnd":
DefinedTermTitles.add(message.toString());
countryFlag = false;
message.setLength(0);
break;
}
}
public static void main (String argv []) {
SAXParserFactory factory = SAXParserFactory.newInstance();
try {
String path = "INPUT_PATH_HERE";
InputStream xmlInput = new FileInputStream(path + "test.xml");
SAXParser saxParser = factory.newSAXParser();
SaxHandler handler = new SaxHandler();
saxParser.parse(xmlInput, handler);
System.out.println(handler.DefinedTermTitles);
System.out.println(handler.ClauseTitles);
} catch (Exception err) {
err.printStackTrace ();
}
}
}
Output
[Australia, Isle of Man, France]
[1.02 Accounting Terms., Smallest Street-Legal Car at 99cm wide and 59 kg in weight, Most Valuable Car at $15 million]
Since you have been asking this question now for different libraries, here is a solution with XMLParser
. The author of this XML had maybe not the best understanding how XML works. If I where you I'd rather put some filtering in place, to make this sane again (e.g. <tagBegin/>X<tagEnd/>
to <tag>x</tag>
).
def xml = '''\
<records>
<car name='HSV Maloo' make='Holden' year='2006'>
<ae_definedTermTitleBegin />Australia<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />1.02 <u>Accounting Terms</u>.<ae_clauseTitleEnd />
</car>
<car name='P50' make='Peel' year='1962'>
<ae_definedTermTitleBegin />Isle of Man<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />Smallest Street-Legal Car at 99cm wide and 59 kg in weight<ae_clauseTitleEnd />
</car>
<car name='Royale' make='Bugatti' year='1931'>
<ae_definedTermTitleBegin />France<ae_definedTermTitleEnd />
<ae_clauseTitleBegin />Most Valuable Car at $15 million<ae_clauseTitleEnd />
</car>
</records>
'''
def underp = { l ->
l.inject([texts: [:]]) { r, it ->
if (it.respondsTo('name') && it.name().endsWith('Begin')) {
r.texts[(r.last=it.name().replaceFirst(/Begin$/,''))] = ''
} else if (it.respondsTo('name') && it.name().endsWith('End')) {
r.last = null
} else if (r.last) {
r.texts[r.last] += (it instanceof String) ? it : it.text()
}
r
}.texts
}
def root = new XmlParser().parseText(xml)
root.car.each{
println underp(it.children()).inspect()
}
prints
['ae_definedTermTitle':'Australia', 'ae_clauseTitle':'1.02 Accounting Terms.']
['ae_definedTermTitle':'Isle of Man', 'ae_clauseTitle':'Smallest Street-Legal Car at 99cm wide and 59 kg in weight']
['ae_definedTermTitle':'France', 'ae_clauseTitle':'Most Valuable Car at $15 million']
来源:https://stackoverflow.com/questions/27302758/discard-html-tags-within-custom-tags-while-getting-text-in-xhtml-using-sax-parse