How to not transform special characters to html entities with owasp antisamy

陌路散爱 提交于 2019-12-09 07:33:25


I use Owasp Anti samy with Ebay policy file to prevent XSS attacks on my website.

I also use Hibernate search to index my objects.

When I use this code:

String html = "special word: été";    

// use the Ebay configuration file    
Policy policy = Policy.getInstance(xssPolicyFile.getInputStream());

AntiSamy as = new AntiSamy();
CleanResults cr = as.scan(html, policy);

// result is now : "special word: été"
result = cr.getCleanHTML();

As you can see all chars "é" has been transformed to their html entity equivalent "é"

My page is on UTF-8, so I don't need this transformation. Moreover, when I index this text with Hibernate Search, it indexes the word with html entities, so I can't find word "été" on my index.

How can I force antisamy to not transform special chars to their html entity equivalent ?


PS: an issue has been opened :


I ran into the same problem this morning.

I have encapsulated antisamy in a class and I use apache StringEscapeUtil from apache common-lang to restore special characters.

 CleanResults cleanResults = antiSamy.scan(taintedHtml);
 cleanedHtml = cleanResults.getCleanHTML();  
 return StringEscapeUtils.unescapeHtml(cleanedHtml)

The result is a cleaned up HTML without the HTML escaping of special characters.

Hope this helps.


Like Mohamad said it in a comment, Antisamy has just released a new directive named : entityEncodeIntlChars

here is the detail :

It seems that this directive solves the problem.


After scouring the AntiSamy source code, I found no way of changing this behavior apart from modifying AntiSamy.


Check out this one:

Grab the source and notice that key classes (AntiSamyDOMScanner, CleanResults) use standard framework objects (like XmlDocument). Compile and run with the binary you compiled - so that you can see everything in a debugger - as in which of the major classes actually corrupts your data. With that in hand you'll be able to either change a few properties on major objects to make it stop or inject your own post-processing to revert the wrongdoing (say with a regexp). Latter you can expose that as additional top-level property, say one named NoMess :-)

Chances are that behavior in that respect is different between languages (there's 3 in that trunk) but the same tactics will work no matter which one you have to deal with.

