I\'m trying to extract a specific information from different html pages. Basically the information is a 10 digits number which may have different forms such :
000
<?php
preg_match_all("/\+?[0-9][\d-\()-\s+]{5,12}[1-9]/", $string, $matches);
print_r($matches);
?>
Consider other delimiters besides hyphens, not to mention parentheses.
(?:1\s*?[-.]?\s*)?(?:\(\s*d{3}\s*\)|d{3})\s*?[-.]?\s*\d{3}\s*?[-.]?\s*\d{4}\b
Okay, maybe that's more comprehensive than you need, but really this can get as complicated as you like. You can expand it to look for international phone numbers, extensions, and so forth, but that might not be worth it for you.
\b[0-9]{3}\s*[-]?\s*[0-9]{3}\s*[-]?\s*[0-9]{4}\b
Edit
Added word boundaries.
This will match on all three examples you listed.
(\d{3}\s*-?\s*\d{3}\s*-?\s*\d{4})
Here's a good starting point:
<?php
// all on one line...
$regex = '/^(?:1(?:[. -])?)?(?:\((?=\d{3}\)))?([2-9]\d{2})(?:(?<=\(\d{3})\))? ?(?:(?<=\d{3})[.-])?([2-9]\d{2})[. -]?(\d{4})(?: (?i:ext)\.? ?(\d{1,5}))?$/';
// or broken up
$regex = '/^(?:1(?:[. -])?)?(?:\((?=\d{3}\)))?([2-9]\d{2})'
.'(?:(?<=\(\d{3})\))? ?(?:(?<=\d{3})[.-])?([2-9]\d{2})'
.'[. -]?(\d{4})(?: (?i:ext)\.? ?(\d{1,5}))?$/';
?>
Note the non-capturing subpatterns (which look like (?:stuff)
). That makes formatting easy:
<?php
$formatted = preg_replace($regex, '($1) $2-$3 ext. $4', $phoneNumber);
// or, provided you use the $matches argument in preg_match
$formatted = "($matches[1]) $matches[2]-$matches[3]";
if ($matches[4]) $formatted .= " $matches[4]";
?>
And some example results for you:
520-555-5542 :: MATCH
520.555.5542 :: MATCH
5205555542 :: MATCH
520 555 5542 :: MATCH
520) 555-5542 :: FAIL
(520 555-5542 :: FAIL
(520)555-5542 :: MATCH
(520) 555-5542 :: MATCH
(520) 555 5542 :: MATCH
520-555.5542 :: MATCH
520 555-0555 :: MATCH
(520)5555542 :: MATCH
520.555-4523 :: MATCH
19991114444 :: FAIL
19995554444 :: MATCH
514 555 1231 :: MATCH
1 555 555 5555 :: MATCH
1.555.555.5555 :: MATCH
1-555-555-5555 :: MATCH
520-555-5542 ext.123 :: MATCH
520.555.5542 EXT 123 :: MATCH
5205555542 Ext. 7712 :: MATCH
520 555 5542 ext 5 :: MATCH
520) 555-5542 :: FAIL
(520 555-5542 :: FAIL
(520)555-5542 ext .4 :: FAIL
(512) 555-1234 ext. 123 :: MATCH
1(555)555-5555 :: MATCH
You'll probably get a lot of false positives if you allow spaces and dashes like you're suggesting.