The Better Solution
A regex is the wrong tool here. HTML is not a regular language, and cannot be accurately parsed using regular expressions. Use a DOM parser instead. Not only is it much easier, it's more accurate and reliable, and won't break when the format of the markup changes in the future.
This is how you would get the contents inside a
tag using PHP's built-in DOMDocument class:
$dom = new DOMDocument;
$dom->loadHTML($yourHTMLString);
$result = $dom->getElementsByTagName('span')->item(0)->nodeValue;
If there are multiple tags, and you want to get the node values from all of them, you could simply use a foreach
loop, like so:
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('span') as $tag) {
echo $tag->nodeValue . '
';
}
And finally, to extract just the number from the node value, you have several options:
// Split on space, and get first part
echo explode(' ', $result, 2)[0];
// Replace everything that is not a digit or comma
echo preg_replace('/[^\d,]/', '', $result);
// Get everything before the first space
echo strstr($result, ' ', 1);
// Remove everything after the first space
echo strtok($result, ' ');
All these statements will output 414,817
. There's a whole host of string functions available for you to use, and you can choose one solution that suits your requirements.
The Regex-based solution
If you absolutely must use preg_match()
, then you can use the following:
if (preg_match('#]*>([\d,]+).*?#', $result, $matches)) {
echo $matches[1];
}
[^<>]*
means "match any number of characters except angled brackets", ensuring that we don't accidentally break out of the tag we're in.
.*?
(note the ?
) means "match any number of characters, but only as few as possible". This avoids matching from the first to the last
tag in the markup (if there are multiple
s).
I make absolutely no guarantees that the regex will always work, but it should be enough for those who want to finish up a one-off job. In such cases, it's probably better to go with a regex that works on sane things than weep about things not being universally perfect :)