Here\'s the goal: to replace all standalone ampersands with & but NOT replace those that are already part of an HTML entity such as .
I think I nee
You could always run html_entity_decode
before you run htmlentities
? Works unless you only want to do ampersands (and even then you can play with the charset parameters).
Much easier and faster than a regex.
The others are good suggestions, and might be better way to do it. But I thought I'd try to answer the question as asked--if only to provide a regex example.
The following is the special exploded form allowed in some engines. Of course the odd thing is that an engine which allows commented regexes allow other simplified expresssions--but not as generic. I'll put those simplified expressions in parens in the comments.
& # an ampersand
( \# # a '#' character
[1-9] # followed by a non-zero digit,
[0-9]{1,3} # with between 2 and 4 (\d{1,3} or \p{IsDigit}{1,3})
| [A-Za-z] # OR a letter (\p{IsAlpha})
[0-9A-Za-z]+ # followed by letters or numbers (\p{IsAlnum}+)
)
; # all capped with a ';'
You could even throw a bunch of expected entities in there as well, to help out the regex scanner.
& # an ampersand
( amp | apos | gt | lt | nbsp | quot
# standard entities
| bull | hellip | [lr][ds]quo | [mn]dash | permil
# some fancier ones
| \# # a '#' character
[1-9] # followed by a non-zero digit,
[0-9]{1,3} # with between 2 and 4
| [A-Za-z] # OR a letter
[0-9A-Za-z]+ # followed by letters or numbers
)
; # all capped with a ';'
I had the same problem, was originally using:
$string = htmlspecialchars($string, ENT_QUOTES, "UTF-8", FALSE);
But needed it work with PHP4 and a mix of CharSets, ended up with:
function htmlspecialchars_custom($string)
{
$string = str_replace("\x05\x06", "", $string);
$string = preg_replace("/&([a-z\d]{2,7}|#\d{2,5});/i", "\x05\x06$1", $string);
$string = htmlspecialchars($string, ENT_QUOTES);
$string = str_replace("\x05\x06", "&", $string);
return $string;
}
It is not perfect, but good enough for my needs.
PHP's htmlentities()
has double_encode
argument for this.
If you want to do things like that in regular expressions, then negative assertions come useful:
preg_replace('/&(?![a-z#]+;)/i','&',$txt);
Ross led me to a good answer. Here's the code that seems to work fairly well. So far. :-) The goal, again, is the convert HTML to XML, specifically descriptions for RSS feeds. In the brief testing I've done so far (with some fairly fairly quirky data) I've been able to take strings wrapped in CDATA and unwrap it. Passes validation tests. Thanks, Ross.
//decode all entities
$string=html_entity_decode($string,ENT_COMPAT,'UTF-8');
//entity-encode only &<> and double quotes
$string=htmlspecialchars($string,ENT_COMPAT,'UTF-8');