How to unescape HTML character entities in Java?

前端 未结 11 1759
耶瑟儿~
耶瑟儿~ 2020-11-21 22:38

Basically I would like to decode a given Html document, and replace all special chars, such as \" \" -> \" \", \">\" -

相关标签:
11条回答
  • 2020-11-21 23:23

    The following library can also be used for HTML escaping in Java: unbescape.

    HTML can be unescaped this way:

    final String unescapedText = HtmlEscape.unescapeHtml(escapedText); 
    
    0 讨论(0)
  • 2020-11-21 23:25

    A very simple but inefficient solution without any external library is:

    public static String unescapeHtml3( String str ) {
        try {
            HTMLDocument doc = new HTMLDocument();
            new HTMLEditorKit().read( new StringReader( "<html><body>" + str ), doc, 0 );
            return doc.getText( 1, doc.getLength() );
        } catch( Exception ex ) {
            return str;
        }
    }
    

    This should be use only if you have only small count of string to decode.

    0 讨论(0)
  • 2020-11-21 23:25

    Consider using the HtmlManipulator Java class. You may need to add some items (not all entities are in the list).

    The Apache Commons StringEscapeUtils as suggested by Kevin Hakanson did not work 100% for me; several entities like &#145 (left single quote) were translated into '222' somehow. I also tried org.jsoup, and had the same problem.

    0 讨论(0)
  • 2020-11-21 23:32

    The libraries mentioned in other answers would be fine solutions, but if you already happen to be digging through real-world html in your project, the Jsoup project has a lot more to offer than just managing "ampersand pound FFFF semicolon" things.

    // textValue: <p>This is a&nbsp;sample. \"Granny\" Smith &#8211;.<\/p>\r\n
    // becomes this: This is a sample. "Granny" Smith –.
    // with one line of code:
    // Jsoup.parse(textValue).getText(); // for older versions of Jsoup
    Jsoup.parse(textValue).text();
    
    // Another possibility may be the static unescapeEntities method:
    boolean strictMode = true;
    String unescapedString = org.jsoup.parser.Parser.unescapeEntities(textValue, strictMode);
    

    And you also get the convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. It's open source and MIT licence.

    0 讨论(0)
  • 2020-11-21 23:32

    Incase you want to mimic what php function htmlspecialchars_decode does use php function get_html_translation_table() to dump the table and then use the java code like,

    static Map<String,String> html_specialchars_table = new Hashtable<String,String>();
    static {
            html_specialchars_table.put("&lt;","<");
            html_specialchars_table.put("&gt;",">");
            html_specialchars_table.put("&amp;","&");
    }
    static String htmlspecialchars_decode_ENT_NOQUOTES(String s){
            Enumeration en = html_specialchars_table.keys();
            while(en.hasMoreElements()){
                    String key = en.nextElement();
                    String val = html_specialchars_table.get(key);
                    s = s.replaceAll(key, val);
            }
            return s;
    }
    
    0 讨论(0)
提交回复
热议问题