I am parsing XML using DocumentBuilder
in java 1.4.
XML has first line as
xml version=\"1.0\" encoding=\"GBK\"
I want to
This one works for various encodings, taking into concern both the BOM and the XML declaration. Defaults to UTF-8
if neither applies:
String encoding;
FileReader reader = null;
XMLStreamReader xmlStreamReader = null;
try {
InputSource is = new InputSource(file.toURI().toASCIIString());
XMLInputSource xis = new XMLInputSource(is.getPublicId(), is.getSystemId(), null);
xis.setByteStream(is.getByteStream());
PropertyManager pm = new PropertyManager(PropertyManager.CONTEXT_READER);
for (Field field : PropertyManager.class.getDeclaredFields()) {
if (field.getName().equals("supportedProps")) {
field.setAccessible(true);
((HashMap<String, Object>) field.get(pm)).put(
Constants.XERCES_PROPERTY_PREFIX + Constants.ERROR_REPORTER_PROPERTY,
new XMLErrorReporter());
break;
}
}
encoding = new XMLEntityManager(pm).setupCurrentEntity("[xml]".intern(), xis, false, true);
if (encoding != "UTF-8") {
return encoding;
}
// From @matthias-heinrich’s answer:
reader = new FileReader(file);
xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader(reader);
encoding = xmlStreamReader.getCharacterEncodingScheme();
if (encoding == null) {
encoding = "UTF-8";
}
} catch (RuntimeException e) {
throw e;
} catch (Exception e) {
throw new UndeclaredThrowableException(e);
} finally {
if (xmlStreamReader != null) {
try {
xmlStreamReader.close();
} catch (XMLStreamException e) {
}
}
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
}
}
}
return encoding;
Tested on Java 6 with:
UTF-8
XML file with BOM, with XML declaration ✓UTF-8
XML file without BOM, with XML declaration ✓UTF-8
XML file with BOM, without XML declaration ✓UTF-8
XML file without BOM, without XML declaration ✓ISO-8859-1
XML file (no BOM), with XML declaration ✓UTF-16LE
XML file with BOM, without XML declaration ✓UTF-16BE
XML file with BOM, without XML declaration ✓Standing on the shoulders of these giants:
import java.io.*;
import java.lang.reflect.*;
import java.util.*;
import javax.xml.stream.*;
import org.xml.sax.*;
import com.sun.org.apache.xerces.internal.impl.*;
import com.sun.org.apache.xerces.internal.xni.parser.*;
Using javax.xml.stream.XMLStreamReader
to parse your file, then you can call getEncoding()
.
One way to this works like this
final XMLStreamReader xmlStreamReader = XMLInputFactory.newInstance().createXMLStreamReader( new FileReader( testFile ) );
//running on MS Windows fileEncoding is "CP1251"
String fileEncoding = xmlStreamReader.getEncoding();
//the XML declares UTF-8 so encodingFromXMLDeclaration is "UTF-8"
String encodingFromXMLDeclaration = xmlStreamReader.getCharacterEncodingScheme();