Filtering illegal XML characters in Java

后端 未结 7 1591
粉色の甜心
粉色の甜心 2020-12-30 09:19

XML spec defines a subset of Unicode characters which are allowed in XML documents: http://www.w3.org/TR/REC-xml/#charsets.

How do I filter out these characters from

相关标签:
7条回答
  • 2020-12-30 09:21

    You can use regex (Regular Expression) to do the work, see an example in the comments here

    0 讨论(0)
  • 2020-12-30 09:33

    Use either escapeXml10 or escapeXml11. These functions escape characters like ", &, ', <, > and a few more but also filter invalid characters.

    For those who don't want to filter invalid characters but escape them with a different escaping system, look at my answer here https://stackoverflow.com/a/59475093/3882565.

    0 讨论(0)
  • 2020-12-30 09:35

    Loosely based on a comment in the link from Stephen C's answer, and wikipedia for the XML 1.1 spec here's a java method that shows you how to remove illegal chars using regular expression replace:

    boolean isAllValidXmlChars(String s) {
      // xml 1.1 spec http://en.wikipedia.org/wiki/Valid_characters_in_XML
      if (!s.matches("[\\u0001-\\uD7FF\\uE000-\uFFFD\\x{10000}-\\x{10FFFF}]")) {
        // not in valid ranges
        return false;
      }
      if (s.matches("[\\u0001-\\u0008\\u000b-\\u000c\\u000E-\\u001F\\u007F-\\u0084\\u0086-\\u009F]")) {
        // a control character
        return false;
      }
    
      // "Characters allowed but discouraged"
      if (s.matches(
        "[\\uFDD0-\\uFDEF\\x{1FFFE}-\\x{1FFFF}\\x{2FFFE}–\\x{2FFFF}\\x{3FFFE}–\\x{3FFFF}\\x{4FFFE}–\\x{4FFFF}\\x{5FFFE}-\\x{5FFFF}\\x{6FFFE}-\\x{6FFFF}\\x{7FFFE}-\\x{7FFFF}\\x{8FFFE}-\\x{8FFFF}\\x{9FFFE}-\\x{9FFFF}\\x{AFFFE}-\\x{AFFFF}\\x{BFFFE}-\\x{BFFFF}\\x{CFFFE}-\\x{CFFFF}\\x{DFFFE}-\\x{DFFFF}\\x{EFFFE}-\\x{EFFFF}\\x{FFFFE}-\\x{FFFFF}\\x{10FFFE}-\\x{10FFFF}]"
      )) {
        return false;
      }
    
      return true;
    }
    
    0 讨论(0)
  • 2020-12-30 09:36

    This page includes a Java method for stripping out invalid XML characters by testing whether each character is within spec, though it doesn't check for highly discouraged characters

    Incidentally, escaping the characters is not a solution since the XML 1.0 and 1.1 specs do not allow the invalid characters in escaped form either.

    0 讨论(0)
  • 2020-12-30 09:36

    Using StringEscapeUtils.escapeXml(xml) from commons-lang will escape, not filter the characters.

    0 讨论(0)
  • 2020-12-30 09:46

    Here's a solution that takes care of the raw char as well as the escaped char in the stream works with stax or sax. It needs extending for the other invalid chars but you get the idea

    import java.io.BufferedReader;
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.FileOutputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.io.InputStreamReader;
    import java.io.OutputStreamWriter;
    import java.io.Reader;
    import java.io.UnsupportedEncodingException;
    import java.io.Writer;
    
    import org.apache.commons.io.IOUtils;
    import org.apache.xerces.util.XMLChar;
    
    public class IgnoreIllegalCharactersXmlReader extends Reader {
    
        private final BufferedReader underlyingReader;
        private StringBuilder buffer = new StringBuilder(4096);
        private boolean eos = false;
    
        public IgnoreIllegalCharactersXmlReader(final InputStream is) throws UnsupportedEncodingException {
            underlyingReader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
        }
    
        private void fillBuffer() throws IOException {
            final String line = underlyingReader.readLine();
            if (line == null) {
                eos = true;
                return;
            }
            buffer.append(line);
            buffer.append('\n');
        }
    
        @Override
        public int read(char[] cbuf, int off, int len) throws IOException {
            if(buffer.length() == 0 && eos) {
                return -1;
            }
            int satisfied = 0;
            int currentOffset = off;
            while (false == eos && buffer.length() < len) {
                fillBuffer();
            }
            while (satisfied < len && buffer.length() > 0) {
                char ch = buffer.charAt(0);
                final char nextCh = buffer.length() > 1 ? buffer.charAt(1) : '\0';
                if (ch == '&' && nextCh == '#') {
        final StringBuilder entity = new StringBuilder();
        // Since we're reading lines it's safe to assume entity is all
        // on one line so next char will/could be the hex char
        int index = 0;
        char entityCh = '\0';
        // Read whole entity
        while (entityCh != ';') {
            entityCh = buffer.charAt(index++);
            entity.append(entityCh);
        }
        // if it's bad get rid of it and clean it from the buffer and point to next valid char
        if (entity.toString().equals("&#2;")) {
            buffer.delete(0, entity.length());
            continue;
        }
                }
                if (XMLChar.isValid(ch)) {
        satisfied++;
        cbuf[currentOffset++] = ch;
                }
                buffer.deleteCharAt(0);
            }
            return satisfied;
        }
    
        @Override
        public void close() throws IOException {
            underlyingReader.close();
        }
    
        public static void main(final String[] args) {
            final File file = new File(
        <XML>);
            final File outFile = new File(file.getParentFile(), file.getName()
        .replace(".xml", ".cleaned.xml"));
            Reader r = null;
            Writer w = null;
            try {
                r = new IgnoreIllegalCharactersXmlReader(new FileInputStream(file));
                w = new OutputStreamWriter(new FileOutputStream(outFile),"UTF-8");
                IOUtils.copyLarge(r, w);
                w.flush();
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                IOUtils.closeQuietly(r);
                IOUtils.closeQuietly(w);
            }
        }
    }
    
    0 讨论(0)
提交回复
热议问题