so I have the code
function getTagContent($string, $tagname) {
$pattern = \"/<$tagname.*?>(.*)<\\/$tagname>/\";
preg_match($pattern, $string
try DOM
$url = "http://www.freakonomics.com/2008/09/24/wall-street-jokes-please/";
$doc = new DOMDocument();
$dom = $doc->loadHTMLFile($url);
$items = $doc->getElementsByTagName('title');
for ($i = 0; $i < $items->length; $i++)
{
echo $items->item($i)->nodeValue . "\n";
}
Probably because the title is spread on multiple lines. You need to add the option s
so that the dot will also match any line returns.
$pattern = "/<$tagname.*?>(.*)<\/$tagname>/s";
The 'title' tag is not on the same line as its closing tag, so your preg_match doesn't find it.
In Perl, you can add a /s switch to make it slurp the whole input as though on one line: I forget whether preg_match will let you do so or not.
But this is just one of the reasons why parsing XML and variants with regexp is a bad idea.
Have your php function getTagContent
like this:
function getTagContent($string, $tagname) {
$pattern = '/<'.$tagname.'[^>]*>(.*?)<\/'.$tagname.'>/is';
preg_match($pattern, $string, $matches);
print_r($matches);
}
It is important to use non-greedy match all .*?
for matching text between start and end of tag and equally important is to use flags s
for DOTALL (matches new line as well) and i
for ignore case comparison.