Is there any function that I can use to parse any string to ensure it won\'t cause xml parsing problems? I have a php script outputting a xml file with content obtained from
htmlspecialchars($trim($_POST['content'], ENT_XML1, 'UTF-8');
Should do it.
Use htmlspecialchars() will solve your problem. See the post below.
PHP - Is htmlentities() sufficient for creating xml-safe values?
You take it the wrong way - don't look for a parser which doesn't give you errors. Instead try to have a well-formed xml.
How did you get ’
from the user? If he literally typed it in, you are not processing the input correctly - for example you should escape & to &
. If it is you who put the entity there (perhaps in place of some apostrophe), either define it in DTD (<!ENTITY rsquo "&x2019;">
) or write it using a numeric notation (’
), because almost every of the named entities are a part of HTML. XML defines only a few basic ones, as Gumbo pointed out.
EDIT based on additions to the question:
]]> <°)))><
, you have a problem.&
which should be interpreted like &).If you use htmlspecialchars() with ENT_QUOTES, it should be ok, but see how Drupal does it.
html_entity_decode($string, ENT_QUOTES, 'UTF-8')
I had a similar problem that the data i needed to add to the XML was already being returned by my code as htmlentities() (not in the database like this).
i used:
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createElement('string', htmlspecialchars(html_entity_decode($string, ENT_QUOTES, 'UTF-8'), ENT_XML1, 'UTF-8')));
$doc->appendChild($element);
or if it was not already in htmlentities() just the below should work
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createElement('string', htmlspecialchars($string, ENT_XML1, 'UTF-8')));
$doc->appendChild($element);
basically using htmlspecialchars with ENT_XML1 should get user imputed data into XML safe data (and works fine for me):
htmlspecialchars($string, ENT_XML1, 'UTF-8');
The problem is that your htmlentities
function is doing what it should - generating HTML entities from characters. You're then inserting these into an XML document which doesn't have the HTML entities defined (things like ’
are HTML-specific).
The easiest way to handle this is keep all input raw (i.e. don't parse with htmlentities
), then generate your XML using PHP's XML functions.
This will ensure that all text is properly encoded, and your XML is well-formed.
Example:
$user_input = "...<>&'";
$doc = new DOMDocument('1.0','utf-8');
$element = $doc->createElement("content");
$element->appendChild($doc->createTextNode($user_input));
$doc->appendChild($element);