I have some plain text and html. I need to create a PHP method that will return the same html, but with before any instances of th
This is going to be tricky.
Whilst you could do it with simple regex hacking, ignoring anything inside a tag, something like the naïve:
preg_replace(
'My(<[^>]>)*\s+(<[^>]>)*name(<[^>]>)*\s+(<[^>]>)*is(<[^>]>)*\s+(<[^>]>)*Josh',
'<span class="marked">$0</span>', $html
)
that's not at all reliable. Partly because HTML can't be parsed with regex: it's valid to put >
in an attribute value, and other non-element constructs like comments will be mis-parsed. Even with a more rigorous expression to match tags — something horribly unwieldy like <[^>\s]*(\s+([^>\s]+(\s*=\s*([^"'\s>][\s>]*|"[^"]*"|'[^']*')\s*))?)*\s*\/?>
, you'd still have many of the same problems, especially if the input HTML is not guaranteed valid.
This could even be a security issue, as if the HTML you are processing is untrusted, it could fool your parser into turning text content into attributes, resulting in script injection.
But even ignoring that, you wouldn't be able to ensure proper element nesting. So you might turn:
<em>My name is <strong>Josh</strong>!!!</em>
into the misnested and invalid:
<span class="marked"><em>My name is <strong>Josh</strong></span>!!!</em>
or:
My
<table><tr><td>name is</td></tr></table>
Josh
where those elements can't be wrapped with a span. If you're unlucky, the browser fixups to ‘correct’ your invalid output could end up leaving half the page ‘marked’, or messing up the page layout.
So you would have to do this on a parsed-DOM level rather than with string hacking. You could parse the whole string in using PHP, process it and re-serialise, but if it's acceptable from an accessibility point of view, it would probably be easier to do it at the browser end in JavaScript, where the content is already parsed into DOM nodes.
It's still going to be pretty hard. This question handles it where the text will all be inside the same text node, but that's a much simpler case.
What you would effectively have to do would be:
for each Element that may contain a <span>:
for each child node in the element:
generate the text content of this node and all following siblings
match the target string/regex against the whole text
if there is no match:
break the outer loop - on to the next element.
if the current node is an element node and the index of the match is not 0:
break the inner loop - on to the next sibling node
if the current node is a text node and the index of the match is > the length of the Text node data:
break the inner loop - on to the next sibling node
// now we have to find the position of the end of the match
n is the length of the match string
iterate through the remaining text node data and sibling text content:
compare the length of the text content with n
less?:
subtract length from n and continue
same?:
we've got a match on a node boundary
split the first text node if necessary
insert a new span into the document
move all the nodes from the first text node to this boundary inside the span
break to outer loop, next element
greater?:
we've got a match ending inside the node.
is the node a text node?:
then we can split the text node
also split the first text node if necessary
insert a new span into the document
move all contained nodes inside the span
break to outer loop, next element
no, an element?:
oh dear! We can't insert a span here
Ouch.
Here's an alternative suggestion which is slightly less nasty, if it's acceptable to wrap every text node that is part of a match separately. So:
<p>Oh, my</p> name <div><div>is</div><div> Josh
would leave you with the output:
<p>Oh, <span class="marked">my</span></p>
<span class="marked"> name </span>
<div><div><span class="marked">is</span></div></div>
<span class="marked"> Josh</span>
which might look OK, depending on how you're styling the matches. It would also solve the misnesting problem of matches partially inside elements.
ETA: Oh sod the pseudocode, I've more-or-less written the code now anyway, might as well finish it. Here's a JavaScript version of the latter approach:
markTextInElement(document.body, /My\s+name\s+is\s+Josh/gi);
function markTextInElement(element, regexp) {
var nodes= [];
collectTextNodes(nodes, element);
var datas= nodes.map(function(node) { return node.data; });
var text= datas.join('');
// Get list of [startnodei, startindex, endnodei, endindex] matches
//
var matches= [], match;
while (match= regexp.exec(text)) {
var p0= getPositionInStrings(datas, match.index, false);
var p1= getPositionInStrings(datas, match.index+match[0].length, true);
matches.push([p0[0], p0[1], p1[0], p1[1]]);
}
// Get list of nodes for each match, splitted at the edges of the
// text. Reverse-iterate to avoid the splitting changing nodes we
// have yet to process.
//
for (var i= matches.length; i-->0;) {
var ni0= matches[i][0], ix0= matches[i][1], ni1= matches[i][2], ix1= matches[i][3];
var mnodes= nodes.slice(ni0, ni1+1);
if (ix1<nodes[ni1].length)
nodes[ni1].splitText(ix1);
if (ix0>0)
mnodes[0]= nodes[ni0].splitText(ix0);
// Replace each text node in the sublist with a wrapped version
//
mnodes.forEach(function(node) {
var span= document.createElement('span');
span.className= 'marked';
node.parentNode.replaceChild(span, node);
span.appendChild(node);
});
}
}
function collectTextNodes(texts, element) {
var textok= [
'applet', 'col', 'colgroup', 'dl', 'iframe', 'map', 'object', 'ol',
'optgroup', 'option', 'script', 'select', 'style', 'table',
'tbody', 'textarea', 'tfoot', 'thead', 'tr', 'ul'
].indexOf(element.tagName.toLowerCase()===-1)
for (var i= 0; i<element.childNodes.length; i++) {
var child= element.childNodes[i];
if (child.nodeType===3 && textok)
texts.push(child);
if (child.nodeType===1)
collectTextNodes(texts, child);
};
}
function getPositionInStrings(strs, index, toend) {
var ix= 0;
for (var i= 0; i<strs.length; i++) {
var n= index-ix, l= strs[i].length;
if (toend? l>=n : l>n)
return [i, n];
ix+= l;
}
return [i, 0];
}
// We've used a few ECMAScript Fifth Edition Array features.
// Make them work in browsers that don't support them natively.
//
if (!('indexOf' in Array.prototype)) {
Array.prototype.indexOf= function(find, i /*opt*/) {
if (i===undefined) i= 0;
if (i<0) i+= this.length;
if (i<0) i= 0;
for (var n= this.length; i<n; i++)
if (i in this && this[i]===find)
return i;
return -1;
};
}
if (!('forEach' in Array.prototype)) {
Array.prototype.forEach= function(action, that /*opt*/) {
for (var i= 0, n= this.length; i<n; i++)
if (i in this)
action.call(that, this[i], i, this);
};
}
if (!('map' in Array.prototype)) {
Array.prototype.map= function(mapper, that /*opt*/) {
var other= new Array(this.length);
for (var i= 0, n= this.length; i<n; i++)
if (i in this)
other[i]= mapper.call(that, this[i], i, this);
return other;
};
}
XSL is the right tool for this kind of job. You can do something like this,
<?php
$oldXml= <<<EOT
<html>
<head>
<title>My Name Is Josh!!!</title>
</head>
<body>
<h1>my name is <b>josh</b></h1>
<div>
<a href="http://www.names.com">my name</a> is josh
</div>
<u>my</u> <i>name</i> <b>is</b> <span style="font-family: Tahoma;">Josh</span>.
</body>
</html>
EOT;
$temp = <<<EOT
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="xml" encoding="UTF-8" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="*">
<xsl:copy><xsl:copy-of select="@*"/><xsl:apply-templates/></xsl:copy>
</xsl:template>
<xsl:template match="text()">
<span class="marked">
<xsl:value-of select="current()"/>
</span>
</xsl:template>
</xsl:stylesheet>
EOT;
$xml = new DOMDocument;
$xml->loadXML($oldXml);
$xsl = new DOMDocument;
$xsl->loadXML($temp);
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); // attach the xsl rules
$newXml = $proc->transformToXML($xml);
echo $newXml;
The HTML must be well-formatted XHTML for this work.
You'll need to get deep into the dark woods of Regex for this, but I'm not sure what value doing so would have if you want to apply the same class to every element. If you're hell-bent on every element having a new span, then this page might help: http://haacked.com/archive/2004/10/25/usingregularexpressionstomatchhtml.aspx
Really the more logical thing to do would be to just apply class="marked" to the body element unless you've got a good reason for adding a duplicate class to everything on the page.
Taken from http://www.php.net/manual/en/function.preg-quote.php
$textbody = "This book is very difficult to find.";
$word = "very";
$textbody = preg_replace ("/" . preg_quote($word) . "/",
"<i>" . $word . "</i>",
$textbody);
Here I post precisely what you want.
$string='<html>
<head>
<title>My Name Is Josh!!!</title>
</head>
<body>
<h1>my name is <b>josh</b></h1>
<div>
<a href="http://www.names.com">my name</a> is josh
</div>
<u>my</u> <i>name</i> <b>is</b> <span style="font-family: Tahoma;">Josh</span>.
</body>
';
$string=preg_replace('/>.+</','><span class="marked">$0</span><',$string);
$string=str_replace('<<','<',$string);
$string=str_replace('>>','>',$string);
echo $string;