HTML table to array PHP

前端 未结 2 1662
广开言路
广开言路 2021-01-15 00:23

I have a schoolcalendar online, but I want to have it in my own application. Unfortunately I can\'t get it working with PHP and regex.

The problem is that the table

相关标签:
2条回答
  • 2021-01-15 01:07

    Please use a HTML parser to extract the values. PHP Simple HTML parser is worth a shot: http://simplehtmldom.sourceforge.net/

    0 讨论(0)
  • 2021-01-15 01:14

    Good luck with this one, it's going to be tricky... just 'using a HTML parser' isn't actually going to avoid the major problem, which is the nature of a table that uses rowspans. Although whilst it is always good advice to use a HTML Parser for parsing large amounts of HTML, if you can break that HTML down into smaller, reliable chunks - then parsing using other techniques is always going to be more optimal (but obviously more prone to subtle unexpected differences in the HTML).

    Normalise the table

    If it were me I'd start with something that can detect where your table starts and ends (as I wouldn't want to parse the entire page even when using a HTML Parser if I don't need to):

    $table = $start = $end = false;
    /// 'Vrijdag' should be unique enough, but will fail if it appears elsewhere
    $pos = strpos($html, 'Vrijdag');
    /// find your start and end based on reliable tags
    if ( $pos !== false ) {
      $start = stripos($html, '<tr>', $pos);
      if ( $start !== false ) {
        $end = stripos($html, '</table>', $start);
      }
    }
    
    if ( $start !== false && $end !== false ) {
      /// we can now grab our table $html;
      $table = substr($html, $start, $end - $start);
    }
    

    Then due to the haphazard way the cells are spanned vertically (but seem to be uniform horizontally) I would choose a 'day' column and work downwards.

    if ( $table ) {
      /// break apart based on rows
      $rows = preg_split('#</tr>#i', $table);
      ///
      foreach ( $rows as $key => $row ) {
        $rows[$key] = preg_split('#</td>#i', $row);
      }
    }
    

    The above should give you something like:

    array (
      '0' => array (
        '0' => "<td class='heading'>1",
        '1' => "<td rowspan='1' class='empty'>"
        '2' => "<td rowspan='5' class='value'>3D<br/>009<br/>Hk<br/><br/><br/>"
        ...
      ),
      '0' => array (
        '0' => "<td class='heading'>2",
        '1' => "<td rowspan='2' class='empty'>"
        '2' => "<td rowspan='3' class='value'>Hk<br/>"
        ...
      ),
    )
    

    Now that you have that, you can scan across each row, and where you preg_match a rowspan, you'd have to create a copy of that cell's information into the row below (in the right place) so as to actually create a complete table structure (without rowspans).

    /// can't use foreach here because we want to modify the array within the loop
    $lof = count($rows);
    for ( $rkey=0; $rkey<$lof; $rkey++ ) {
      /// pull out the row
      $row = $rows[$rkey];
      foreach ( $row as $ckey => $cell ) {
        if ( preg_match('/ rowspan=.([0-9]+)./', $cell, $regs) ) {
          $rowspan = (int) $regs[1];
          if ( $rowspan > 1 ) {
            /// there was a gotcha here, I realised afterwards i was constructing
            /// a replacement pattern that looked like this '$14$2'. Which meant
            /// the system tried to find a group at offset 14. To get around this
            /// problem, PHP allows the group reference numbers to be wraped with {}.
            /// so we now get the value of '$1' and '$2' inserted around a literal number
            $newcell = preg_replace('/( rowspan=.)[0-9]+(.)/', '${1}'.($rowspan-1).'${2}', $cell);
            array_splice( $rows[$rkey+1], $ckey, $newcell );
          }
        }
      }
    }
    

    The above should normalise the table so that the rowspans are no longer a problem.

    (Please note the above is theoretical code, I've manually typed it and have yet to test it -- which I will be doing so shortly)

    After testing

    There were a few little bugs with the above that I have updated, namely getting php's arguments for certain functions round the wrong way... After sorting those it seems to work:

    /// grab the html
    $html = file_get_contents('http://www.cibap.nl/beheer/modules/roosters/create_rooster.php?element=CR13A&soort=klas&week=37&jaar=2012');
    
    /// start with nothing
    $table = $start = $end = false;
    /// 'Vrijdag' should be unique enough, but will fail if it appears elsewhere
    $pos = strpos($html, 'Vrijdag');
    
    /// find your start and end based on reliable tags
    if ( $pos !== false ) {
      $start = stripos($html, '<tr>', $pos);
      if ( $start !== false ) {
        $end = stripos($html, '</table>', $start);
      }
    }
    
    /// make sure we have a start and end
    if ( $start !== false && $end !== false ) {
      /// we can now grab our table $html;
      $table = substr($html, $start, $end - $start);
      /// convert brs to something that wont be removed by strip_tags
      $table = preg_replace('#<br ?/>#i', "\n", $table);
    }
    
    if ( $table ) {
      /// break apart based on rows (a close tr is quite reliable to find)
      $rows = preg_split('#</tr>#i', $table);
      /// break apart the cells (a close td is quite reliable to find)
      foreach ( $rows as $key => $row ) {
        $rows[$key] = preg_split('#</td>#i', $row);
      }
    }
    else {
      /// create so we avoid errors
      $rows = array();
    }
    
    /// changed this here from a foreach to a for because it seems
    /// foreach was working from a copy of $rows and so any modifications
    /// we made to $rows while the loop was happening were ignored.
    $lof = count($rows);
    for ( $rkey=0; $rkey<$lof; $rkey++ ) {
      /// pull out the row
      $row = $rows[$rkey];
      /// step each cell in the row
      foreach ( $row as $ckey => $cell ) {
        /// pull out our rowspan value
        if ( preg_match('/ rowspan=.([0-9]+)./', $cell, $regs) ) {
          /// if rowspan is greater than one (i.e. spread across multirows)
          $rowspan = (int) $regs[1];
          if ( $rowspan > 1 ) {
            /// then copy this cell into the next row down, but decrease it's rowspan
            /// so that when we find it in the next row we know how many more times
            /// it should span down.
            $newcell = preg_replace('/( rowspan=.)([0-9]+)(.)/', '${1}'.($rowspan-1).'${3}', $cell);
            array_splice( $rows[$rkey+1], $ckey, 0, $newcell );
          }
        }
      }
    }
    
    /// now finally step the normalised table and get rid of the unwanted tags 
    /// that remain at the same time split our values in to something more useful
    foreach ( $rows as $rkey => $row ) {
      foreach ( $row as $ckey => $cell ) {
        $rows[$rkey][$ckey] = preg_split('/\n+/',trim(strip_tags( $cell )));
      }
    }
    
    echo '<xmp>';
    print_r($rows);
    echo '</xmp>';
    
    0 讨论(0)
提交回复
热议问题