ASCII to HTML-Entities Escaping in Java

我是研究僧i 提交于 2020-01-01 15:10:35

问题


I found this website with escape codes and I'm just wondering if someone has done this already so I don't have to spend couple of hours building this logic:

 StringBuffer sb = new StringBuffer();
 int n = s.length();
 for (int i = 0; i < n; i++) {
     char c = s.charAt(i);
     switch (c) {
         case '\u25CF': sb.append("&#9679;"); break;
         case '\u25BA': sb.append("&#9658;"); break;

         /*
         ... the rest of the hex chars literals to HTML entities
         */  

         default:  sb.append(c); break;
     }
 }

回答1:


These "codes" is a mere decimal representation of the unicode value of the actual character. It seems to me that something like this would work, unless you want to be very strict about which codes get converted, and which don't.

StringBuilder sb = new StringBuilder();
 int n = s.length();
 for (int i = 0; i < n; i++) {
     char c = s.charAt(i);
     if (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append((int)c);
        sb.append(';');
     } else {
        sb.append(c);
     }

 }



回答2:


The other answers don't work correctly for surrogate pairs, e.g. if you have Emojis such as "😀" (see character info). Here's how to do it in Java 8:

StringBuilder sb = new StringBuilder();
s.codePoints().forEach(codePoint -> {
    if (Character.UnicodeBlock.of(codePoint) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append(codePoint);
        sb.append(';');
    } else {
        sb.appendCodePoint(codePoint);
    }
});

And for older Java:

StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
    int c = s.codePointAt(i);
    if (Character.UnicodeBlock.of(c) != Character.UnicodeBlock.BASIC_LATIN) {
        sb.append("&#");
        sb.append(c);
        sb.append(';');
    } else {
        sb.appendCodePoint(c);
    }
    i += Character.charCount(c);
}

A simple way to test if a solution handles surrogate pairs correctly is to use "\uD83D\uDE00" (😀) as the input. If the output is "&#55357;&#56832;", then it's wrong. The correct output is &#128512;.




回答3:


Hmm, what if you did something like this instead:

if (c > 127) {
    sb.append("&#" + (int) c + ";");
} else {
    sb.append(c);
}

Then you just need to determine the range of characters you want HTML escaped. In this case I just specified any character beyond the ASCII table space.



来源:https://stackoverflow.com/questions/5440768/ascii-to-html-entities-escaping-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!