HTML table to array PHP

前端未结

关注

 2  1662

I have a schoolcalendar online, but I want to have it in my own application. Unfortunately I can\'t get it working with PHP and regex.

The problem is that the table

Normalise the table

If it were me I'd start with something that can detect where your table starts and ends (as I wouldn't want to parse the entire page even when using a HTML Parser if I don't need to):

$table = $start = $end = false;
/// 'Vrijdag' should be unique enough, but will fail if it appears elsewhere
$pos = strpos($html, 'Vrijdag');
/// find your start and end based on reliable tags
if ( $pos !== false ) {
  $start = stripos($html, '<tr>', $pos);
  if ( $start !== false ) {
    $end = stripos($html, '</table>', $start);
  }
}

if ( $start !== false && $end !== false ) {
  /// we can now grab our table $html;
  $table = substr($html, $start, $end - $start);
}

Then due to the haphazard way the cells are spanned vertically (but seem to be uniform horizontally) I would choose a 'day' column and work downwards.

if ( $table ) {
  /// break apart based on rows
  $rows = preg_split('#</tr>#i', $table);
  ///
  foreach ( $rows as $key => $row ) {
    $rows[$key] = preg_split('#</td>#i', $row);
  }
}

The above should give you something like:

array (
  '0' => array (
    '0' => "<td class='heading'>1",
    '1' => "<td rowspan='1' class='empty'>"
    '2' => "<td rowspan='5' class='value'>3D<br/>009<br/>Hk<br/><br/><br/>"
    ...
  ),
  '0' => array (
    '0' => "<td class='heading'>2",
    '1' => "<td rowspan='2' class='empty'>"
    '2' => "<td rowspan='3' class='value'>Hk<br/>"
    ...
  ),
)

Now that you have that, you can scan across each row, and where you preg_match a rowspan, you'd have to create a copy of that cell's information into the row below (in the right place) so as to actually create a complete table structure (without rowspans).

/// can't use foreach here because we want to modify the array within the loop
$lof = count($rows);
for ( $rkey=0; $rkey<$lof; $rkey++ ) {
  /// pull out the row
  $row = $rows[$rkey];
  foreach ( $row as $ckey => $cell ) {
    if ( preg_match('/ rowspan=.([0-9]+)./', $cell, $regs) ) {
      $rowspan = (int) $regs[1];
      if ( $rowspan > 1 ) {
        /// there was a gotcha here, I realised afterwards i was constructing
        /// a replacement pattern that looked like this '$14$2'. Which meant
        /// the system tried to find a group at offset 14. To get around this
        /// problem, PHP allows the group reference numbers to be wraped with {}.
        /// so we now get the value of '$1' and '$2' inserted around a literal number
        $newcell = preg_replace('/( rowspan=.)[0-9]+(.)/', '${1}'.($rowspan-1).'${2}', $cell);
        array_splice( $rows[$rkey+1], $ckey, $newcell );
      }
    }
  }
}

The above should normalise the table so that the rowspans are no longer a problem.

(Please note the above is theoretical code, I've manually typed it and have yet to test it -- which I will be doing so shortly)

After testing

There were a few little bugs with the above that I have updated, namely getting php's arguments for certain functions round the wrong way... After sorting those it seems to work:

/// grab the html
$html = file_get_contents('http://www.cibap.nl/beheer/modules/roosters/create_rooster.php?element=CR13A&soort=klas&week=37&jaar=2012');

/// start with nothing
$table = $start = $end = false;
/// 'Vrijdag' should be unique enough, but will fail if it appears elsewhere
$pos = strpos($html, 'Vrijdag');

/// find your start and end based on reliable tags
if ( $pos !== false ) {
  $start = stripos($html, '<tr>', $pos);
  if ( $start !== false ) {
    $end = stripos($html, '</table>', $start);
  }
}

/// make sure we have a start and end
if ( $start !== false && $end !== false ) {
  /// we can now grab our table $html;
  $table = substr($html, $start, $end - $start);
  /// convert brs to something that wont be removed by strip_tags
  $table = preg_replace('#<br ?/>#i', "\n", $table);
}

if ( $table ) {
  /// break apart based on rows (a close tr is quite reliable to find)
  $rows = preg_split('#</tr>#i', $table);
  /// break apart the cells (a close td is quite reliable to find)
  foreach ( $rows as $key => $row ) {
    $rows[$key] = preg_split('#</td>#i', $row);
  }
}
else {
  /// create so we avoid errors
  $rows = array();
}

/// changed this here from a foreach to a for because it seems
/// foreach was working from a copy of $rows and so any modifications
/// we made to $rows while the loop was happening were ignored.
$lof = count($rows);
for ( $rkey=0; $rkey<$lof; $rkey++ ) {
  /// pull out the row
  $row = $rows[$rkey];
  /// step each cell in the row
  foreach ( $row as $ckey => $cell ) {
    /// pull out our rowspan value
    if ( preg_match('/ rowspan=.([0-9]+)./', $cell, $regs) ) {
      /// if rowspan is greater than one (i.e. spread across multirows)
      $rowspan = (int) $regs[1];
      if ( $rowspan > 1 ) {
        /// then copy this cell into the next row down, but decrease it's rowspan
        /// so that when we find it in the next row we know how many more times
        /// it should span down.
        $newcell = preg_replace('/( rowspan=.)([0-9]+)(.)/', '${1}'.($rowspan-1).'${3}', $cell);
        array_splice( $rows[$rkey+1], $ckey, 0, $newcell );
      }
    }
  }
}

/// now finally step the normalised table and get rid of the unwanted tags 
/// that remain at the same time split our values in to something more useful
foreach ( $rows as $rkey => $row ) {
  foreach ( $row as $ckey => $cell ) {
    $rows[$rkey][$ckey] = preg_split('/\n+/',trim(strip_tags( $cell )));
  }
}

echo '<xmp>';
print_r($rows);
echo '</xmp>';

0 讨论(0)