问题
I have this function to ensure every img tag has absolute URL:
function absoluteSrc($html, $encoding = 'utf-8')
{
$dom = new DOMDocument();
// Workaround to use proper encoding
$prehtml = "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html; charset={$encoding}\"></head><body>";
$posthtml = "</body></html>";
if($dom->loadHTML( $prehtml . trim($html) . $posthtml)){
foreach($dom->getElementsByTagName('img') as $img){
if($img instanceof DOMElement){
$src = $img->getAttribute('src');
if( strpos($src, 'http://') !== 0 ){
$img->setAttribute('src', 'http://my.server/' . $src);
}
}
}
$html = $dom->saveHTML();
// Remove remains of workaround / DomDocument additions
$cut_start = strpos($html, '<body>') + 6;
$cut_length = -1 * (1+strlen($posthtml));
$html = substr($html, $cut_start, $cut_length);
}
return $html;
}
It works fine, but it returns decoded entities as unicode characters
$html = <<< EOHTML
<p><img src="images/lorem.jpg" alt="lorem" align="left">
Lorem ipsum dolor sit amet consectetuer Nullam felis laoreet
Cum magna. Suscipit sed vel tincidunt urna.<br>
Vel consequat pretium Curabitur faucibus justo adipiscing elit.
<img src="others/ipsum.png" alt="ipsum" align="right"></p>
<center>© Dr Jekyll & Mr Hyde</center>
EOHTML;
echo absoluteSrc($html);
Outputs:
<p><img src="http://my.server/images/lorem.jpg" alt="lorem" align="left">
Lorem ipsum dolor sit amet consectetuer Nullam felis laoreet
Cum magna. Suscipit sed vel tincidunt urna.<br>
Vel consequat pretium Curabitur faucibus justo adipiscing elit.
<img src="http://my.server/others/ipsum.png" alt="ipsum" align="right"></p>
<center>© Dr Jekyll & Mr Hyde</center>
As you can see in the last line
- © is translated to © (U+00A9),
- to non-breaking space (U+00A0),
- & to &
I would like them to remain the same as in input string.
回答1:
I'd like to know the answer to this as well.
I ended up converting &..; entities to **ENTITY-...-ENTITY**
before parsing and converting back after it is done.
回答2:
The following code seems to work
$dom= new DOMDocument('1.0', 'UTF-8');
$dom->loadHTML($this->htmlentities2stringcode(rawurldecode($content)) );
$dom->preserveWhiteSpace = true;
$innerHTML = str_replace("<html></html><html><body>", "",
str_replace("</body></html>", "",
str_replace("+","%2B",str_replace("<p></p>", "", $this->getInnerHTML( $dom )))));
return $this->stringcode2htmlentities($innerHTML));
}
// ----------------------------------------------------------
function htmlentities2stringcode($string) {
// This method will convert htmlentities such as © into the pseudo string version ^copy^; etc
$from = array_keys($this->getHTMLEntityStringCodeArray());
$to = array_values($this->getHTMLEntityStringCodeArray());
return str_replace($from, $to, $string);
}
// ----------------------------------------------------------
function stringcode2htmlentities ($string) {
// This method will convert pseudo string such as ^copy^ to the original html entity © etc
$from = array_values($this->getHTMLEntityStringCodeArray());
$to = array_keys($this->getHTMLEntityStringCodeArray());
return str_replace($from, $to, $string);
}
// -------------------------------------------------------------
function getHTMLEntityStringCodeArray() {
return array('Α'=>'^Alpha^',
'Β'=>'^Beta^',
'Χ'=>'^Chi^',
'‡'=>'^Dagger^',
'Δ'=>'^Delta^',
'Ε'=>'^Epsilon^',
'Η'=>'^Eta^',
'Γ'=>'^Gamma^',
'Ι'=>'^lota^',
'Κ'=>'^Kappa^',
'Λ'=>'^Lambda^',
'Μ'=>'^Mu^',
'Ν'=>'^Nu^',
'Œ'=>'^OElig^',
'Ω'=>'^Omega^',
'Ο'=>'^Omicron^',
'Φ'=>'^Phi^',
'Π'=>'^Pi^',
'″'=>'^Prime^',
'Ψ'=>'^Psi^',
'Ρ'=>'^Rho^',
'Š'=>'^Scaron^',
'Š'=>'^Scaron^',
'Σ'=>'^Sigma^',
'Τ'=>'^Tau^',
'Θ'=>'^Theta^',
'Υ'=>'^Upsilon^',
'Ξ'=>'^Xi^',
'Ÿ'=>'^Yuml^',
'Ζ'=>'^Zeta^',
'ℵ'=>'^alefsym^',
'α'=>'^alpha^',
'∧'=>'^and^',
'∠'=>'^ang^',
'≈'=>'^asymp^',
'„'=>'^bdquo^',
'β'=>'^beta^',
'•'=>'^bull^',
'∩'=>'^cap^',
'χ'=>'^chi^',
'ˆ'=>'^circ^',
'♣'=>'^clubs^',
'≅'=>'^cong^',
'↵'=>'^crarr^',
'∪'=>'^cup^',
'⇓'=>'^dArr^',
'†'=>'^dagger^',
'↓'=>'^darr^',
'δ'=>'^delta^',
'♦'=>'^diams^',
'∅'=>'^empty^',
' '=>'^emsp^',
' '=>'^ensp^',
'ε'=>'^epsilon^',
'≡'=>'^equiv^',
'η'=>'^eta^',
'€'=>'^euro^',
'∃'=>'^exist^',
'ƒ'=>'^fnof^',
'∀'=>'^forall^',
'⁄'=>'^frasl^',
'γ'=>'^gamma^',
'≥'=>'^ge^',
'⇔'=>'^hArr^',
'↔'=>'^harr^',
'♥'=>'^hearts^',
'…'=>'^hellip^',
'ℑ'=>'^image^',
'∞'=>'^infin^',
'∫'=>'^int^',
'ι'=>'^iota^',
'∈'=>'^isin^',
'κ'=>'^kappa^',
'⇐'=>'^lArr^',
'λ'=>'^lambda^',
'⟨'=>'^lang^',
'←'=>'^larr^',
'⌈'=>'^lceil^',
'“'=>'^ldquo^',
'≤'=>'^le^',
'⌊'=>'^lfloor^',
'∗'=>'^lowast^',
'◊'=>'^loz^',
'‎'=>'^lrm^',
'‹'=>'^lsaquo^',
'‘'=>'^lsquo^',
'—'=>'^mdash^',
'−'=>'^minus^',
'μ'=>'^mu^',
'∇'=>'^nabla^',
'–'=>'^ndash^',
'≠'=>'^ne^',
'∋'=>'^ni^',
'∉'=>'^notin^',
'⊄'=>'^nsub^',
'ν'=>'^nu^',
'œ'=>'^oelig^',
'‾'=>'^oline^',
'ω'=>'^omega^',
'ο'=>'^omicron^',
'⊕'=>'^oplus^',
'∨'=>'^or^',
'⊗'=>'^otimes^',
'∂'=>'^part^',
'‰'=>'^permil^',
'⊥'=>'^perp^',
'φ'=>'^phi^',
'π'=>'^pi^',
'ϖ'=>'^piv^',
'′'=>'^prime^',
'∏'=>'^prod^',
'∝'=>'^prop^',
'ψ'=>'^psi^',
'⇒'=>'^rArr^',
'√'=>'^radic^',
'⟩'=>'^rang^',
'→'=>'^rarr^',
'⌉'=>'^rceil^',
'”'=>'^rdquo^',
'ℜ'=>'^real^',
'⌋'=>'^rfloor^',
'ρ'=>'^rho^',
'‏'=>'^rlm^',
'›'=>'^rsaquo^',
'’'=>'^rsquo^',
'‚'=>'^sbquo^',
'š'=>'^scaron^',
'⋅'=>'^sdot^',
'σ'=>'^sigma^',
'ς'=>'^sigmaf^',
'∼'=>'^sim^',
'♠'=>'^spades^',
'⊂'=>'^sub^',
'⊆'=>'^sube^',
'∑'=>'^sum^',
'⊃'=>'^sup^',
'⊇'=>'^supe^',
'τ'=>'^tau^',
'∴'=>'^there4^',
'θ'=>'^thetasym^',
'ϑ'=>'^thetasym^',
' '=>'^thinsp^',
'˜'=>'^tilde^',
'™'=>'^trade^',
'⇑'=>'^uArr^',
'↑'=>'^uarr^',
'ϒ'=>'^upsih^',
'υ'=>'^upsilon^',
'℘'=>'^weierp^',
'ξ'=>'^xi^',
'ÿ'=>'^yuml^',
'ζ'=>'^zeta^',
'‍'=>'^zwj^',
'‌'=>'^zwnj^');
}
回答3:
An alternative solution is to use DOMDocument->saveHTMLFile() (which doesn't convert HTML entities) and read the contents of the saved file back into a string.
It's not super pretty, but it has the advantage of not having to manually find-and-replace entity codes yourself (twice) as per some other solutions proffered here.
来源:https://stackoverflow.com/questions/3730933/is-there-a-way-to-keep-entities-intact-while-parsing-html-with-domdocument