Extract data from HTML table row column

问题

How to extract data from HTML table in PHP. The data is in this format

Table 1

<tr><td class="body" valign="top"><a href="example"><b>DATA</b></a></td><td class="body" valign="top">Data_Text</td></tr>

Table 2

<tr><th><div id="Data">Data</div></th><td>Data_Text_1</td><td>Data_Text_2</td></tr>

Table 3

<tr><td width="120"><a href="example" target="_blank">DATA</a></td><td>Data_Text</td></tr>

I want to get the Data & Data_Text or (Data_Text_1 & Data_Text_2) from the 3 tables.
I've used

$html = file_get_contents($link);
$doc = new DOMDocument();
@$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodes  = $xpath->query('//td[]');
$nodes2 = $xpath->query('//td[]');

But it cant show any data !

I'll offer bounty for this question on day after tomorrow

回答1:

Using simplehtmldom.php...

<?php

include 'simple_html_dom.php';

$html = file_get_html('thetable.html');

$rows = $html->find('tr');
foreach($rows as $row) {
    echo $row->plaintext;
}

?>

or use 'td'...

<?php

include 'simple_html_dom.php';

$html = file_get_html('thetable.html');

$cells = $html->find('td');
foreach($cells as $cell) {
    echo $cell->plaintext;
}

?>

回答2:

Given an HTML document called xpathTables.html like this:

<html>
  <body>
    <table>
      <tbody>
        <tr><td class="body" valign="top"><a href="example"><b>DATA</b></a></td><td class="body" valign="top">Data_Text</td></tr>
      </tbody> 
    </table>

    <table>
      <tbody>
        <tr><th><div id="Data">Data</div></th><td>Data_Text_1</td><td>Data_Text_2</td></tr>
      </tbody>
    </table>

    <table>
      <tbody>
        <tr><td width="120"><a href="example" target="_blank">DATA</a></td><td>Data_Text</td></tr>
      </tbody>
    </table>
  </body>
</html>

And this PHP script:

<?php

$link = "xpathTables.html";

$html = file_get_contents($link);
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$tables = $doc->getElementsByTagName('table');

$nodes  = $xpath->query('.//tbody/tr/td/a/b', $tables->item(0));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td[@class="body"]', $tables->item(0));
var_dump($nodes->item(1)->nodeValue);

$nodes  = $xpath->query('.//tbody/tr/th/div[@id="Data"]', $tables->item(1));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td', $tables->item(1));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td', $tables->item(1));
var_dump($nodes->item(1)->nodeValue);

$nodes  = $xpath->query('.//tbody/tr/td/a', $tables->item(2));
var_dump($nodes->item(0)->nodeValue);
$nodes  = $xpath->query('.//tbody/tr/td', $tables->item(2));
var_dump($nodes->item(1)->nodeValue);

You get this output:

string(4) "DATA"
string(9) "Data_Text"
string(4) "Data"
string(11) "Data_Text_1"
string(11) "Data_Text_2"
string(4) "DATA"
string(9) "Data_Text"

I didn't understood well your question, so I made this example in order to show all the text nodes your tables had. If you are only interested in some of those nodes, you should pick the XPath queries that do the job.

I included the tags table and tbody, just to make the example more HTML like.

回答3:

Use this single XPath expression:

/*/table/tr//text()[normalize-space()]

This selects any text-node that consists not only odf white-space characters and that is a descendant of any tr element that is a child of a table element that is a child of the top element of the document.

XSLT - based verification:

 <xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "/*/table/tr//text()[normalize-space()]"/>

. . . . . . .
  <xsl:for-each select=
    "/*/table/tr//text()[normalize-space()]">
    "<xsl:copy-of select="."/>"
  </xsl:for-each>
 </xsl:template>
</xsl:stylesheet>

when this transformation is applied against the following XML document:

<html>
 <table>
    <tr>
        <td class="body" valign="top">
            <a href="example">
                <b>DATA</b>
            </a>
        </td>
        <td class="body" valign="top">Data_Text</td>
    </tr>
 </table>

 <table>
    <tr>
        <th>
            <div id="Data">Data</div>
        </th>
        <td>Data_Text_1</td>
        <td>Data_Text_2</td>
    </tr>
 </table>

 <table>
    <tr>
        <td width="120">
            <a href="example" target="_blank">DATA</a>
        </td>
        <td>Data_Text</td>
    </tr>
 </table>
</html>

the XPath expression is evaluated and the selected text nodes are output (twice -- once as the result of the evaluation and they appear concatenated, the second time each selected node is output on a separate line and surrounded by quotes):

DATAData_TextDataData_Text_1Data_Text_2DATAData_Text

. . . . . . .

"DATA"

"Data_Text"

"Data"

"Data_Text_1"

"Data_Text_2"

"DATA"

"Data_Text"

来源：https://stackoverflow.com/questions/10369350/extract-data-from-html-table-row-column

标签

php

regex

dom

xpath