php regex to extract data from HTML table

前端 未结 5 2063
春和景丽
春和景丽 2020-12-30 18:07

I\'m trying to make a regex for taking some data out of a table.

the code i\'ve got now is:

quote1
相关标签:
5条回答
  • 2020-12-30 18:46

    If you really want to use regexes (might be OK if you are really really sure your string will always be formatted like that), what about something like this, in your case :

    $str = <<<A
    <table>
       <tr>
         <td>quote1</td>
         <td>have you trying it off and on again ?</td>
       </tr>
       <tr>
         <td>quote65</td>
         <td>You wouldn't steal a helmet of a policeman</td>
       </tr>
    </table>
    A;
    
    $matches = array();
    preg_match_all('#<tr>\s+?<td>(.*?)</td>\s+?<td>(.*?)</td>\s+?</tr>#', $str, $matches);
    
    var_dump($matches);
    

    A few words about the regex :

    • <tr>
    • then any number of spaces
    • then <td>
    • then what you want to capture
    • then </td>
    • and the same again
    • and finally, </tr>

    And I use :

    • ? in the regex to match in non-greedy mode
    • preg_match_all to get all the matches

    You then get the results you want in $matches[1] and $matches[2] (not $matches[0]) ; here's the output of the var_dump I used (I've remove entry 0, to make it shorter) :

    array
      0 => 
        ...
      1 => 
        array
          0 => string 'quote1' (length=6)
          1 => string 'quote65' (length=7)
      2 => 
        array
          0 => string 'have you trying it off and on again ?' (length=37)
          1 => string 'You wouldn't steal a helmet of a policeman' (length=42)
    

    You then just need to manipulate this array, with some strings concatenation or the like ; for instance, like this :

    $num = count($matches[1]);
    for ($i=0 ; $i<$num ; $i++) {
        echo $matches[1][$i] . ':' . $matches[2][$i] . '<br />';
    }
    

    And you get :

    quote1:have you trying it off and on again ?
    quote65:You wouldn't steal a helmet of a policeman
    

    Note : you should add some security checks (like preg_match_all must return true, count must be at least 1, ...)

    As a side note : using regex to parse HTML is generally not a really good idea ; if you can use a real parser, it should be way safer...

    0 讨论(0)
  • 2020-12-30 18:46

    Don't use regex, use a HTML parser. Such as the PHP Simple HTML DOM Parser

    0 讨论(0)
  • 2020-12-30 18:51

    As usual, extracting text from HTML and other non-regular languages should be done with a parser - regexes can cause problems here. But if you're certain of your data's structure, you could use

    %<td>((?s).*?)</td>\s*<td>((?s).*?)</td>%
    

    to find the two pieces of text. \1:\2 would then be the replacement.

    If the text cannot span more than one line, you'd be safer dropping the (?s) bits...

    0 讨论(0)
  • 2020-12-30 19:01

    Tim's regex probably works, but you may want to consider using the DOM functionality of PHP instead of regex, as it may be more reliable in dealing with minor changes in the markup.

    See the loadHTML method

    0 讨论(0)
  • 2020-12-30 19:01

    Extract each content from <td>

        preg_match_all("%\<td((?s).*?)</td>%", $respose, $mathes);
        var_dump($mathes);
    
    0 讨论(0)
提交回复
热议问题