问题
I have a code that reads an HTML file from my local web server localhost
and then converts it to XHTML
with tidy
. Then i load that XHTML
into my DOM
. the code looks like this
<?php
function getXHTML($html)
{
$options = array("output-html" => true,"quote-nbsp" => true, "drop-proprietary-attributes" => true,"drop-font-tags" => true,"drop-empty-paras" => true,"hide-comments" => true);
$tidy=new tidy();
$xhtml=$tidy->repairString($html,$options);
echo $xhtml;
return $xhtml;
}
$content = file_get_contents("http://localhost/filename.htm");
$page = new DOMDocument();
$xpath=new DOMXPath($page);
$content = getXHTML($content); // this is a tidy function to return XHTML
$page->loadHTML($content);
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$total = $xpath->query($totalPath);
echo $total->length; // this shows zero
?>
the contents of filename.htm
looks like this
<!-- saved from url=(0041)http://www.rtu.ac.in/results/reformat.php -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<link rel="SHORTCUT ICON" href="http://www.rtu.ac.in/favicon.ico">
<link href="./Result - Rajasthan Technical University6_files/styleresults.css" rel="stylesheet" type="text/css">
<title>Result - Rajasthan Technical University</title>
</head>
<body>
<table width="773" cellpadding="5" cellspacing="0" align="center">
<tbody><tr height="60">
<td width="16%" height="60" valign="top"><font color="brown" size="+2"><img src="./Result - Rajasthan Technical University6_files/logo.jpg" width="100" height="102" border="0" align="right"> </font></td>
<td width="72%" height="60" align="center" valign="top"><p><font color="brown" size="+2"><strong>RAJASTHAN TECHNICAL UNIVERSITY </strong></font></p><font color="brown" size="+2">
<p><font size="+1"><strong>B.Tech -IVth SEMESTER -2010(Main) 16.5.2011</strong></font></p><font size="+1"> </font></font></td>
<td width="12%" height="80"><strong>www.rtu.ac.in</strong> </td>
</tr>
</tbody></table>
<br>
<br>
<table width="783" align="center" cellpadding="5" cellspacing="0" class="table">
<tbody>
<tr>
<td width="34%" align="center" valign="top" rowspan="2"><strong>Subject(s) Name </strong> </td>
<td width="10%" align="center" valign="top" colspan="1" rowspan="2"> <strong>Subject(s) Code </strong> </td>
<td align="center" valign="top" colspan="3" rowspan="1"><strong>Marks Obtained </strong> </td>
</tr>
<tr>
<td width="20%" align="center"><strong>Internal</strong> </td>
<td width="18%" align="center"><strong>Theory</strong> </td>
<td width="18%" align="center"> </td>
</tr>
<tr>
<td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-1</strong> </td>
<td width="10%" align="center" style=" border-bottom: 0px none transparent;">4551</td>
<td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 16</td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 50</td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> </td>
</tr>
<tr>
<td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-2</strong> </td>
<td width="10%" align="center" style=" border-bottom: 0px none transparent;"> 4552</td>
<td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 17</td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 61</td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> </td>
</tr>
<tr>
<td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-3</strong> </td>
<td width="10%" align="center" style=" border-bottom: 0px none transparent;">4553</td>
<td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 19</td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 49</td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> </td>
</tr>
<tr>
<td align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-4</strong> </td>
<td align="center" style=" border-bottom: 0px none transparent;">4554</td>
<td align="center" style=" border-bottom: 0px none transparent;"> 14</td>
<td align="center" style=" border-bottom: 0px none transparent;"> 68</td>
<td align="center" style=" border-bottom: 0px none transparent;"> </td>
</tr>
<tr>
<td align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-5</strong> </td>
<td align="center" style=" border-bottom: 0px none transparent;">4555</td>
<td align="center" style=" border-bottom: 0px none transparent;"> 14</td>
<td align="center" style=" border-bottom: 0px none transparent;"> 36</td>
<td align="center" style=" border-bottom: 0px none transparent;"> </td>
</tr>
<tr>
<td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>SUBJECT-6</strong> </td>
<td width="10%" align="center" style=" border-bottom: 0px none transparent;">4556</td>
<td width="20%" align="center" style=" border-bottom: 0px none transparent;"> 19</td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 48</td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> </td>
</tr><tr>
<td align="center" style=" border-bottom: 0px none transparent;"> </td>
<td align="center" style=" border-bottom: 0px none transparent;"> </td>
<td align="center" style=" border-bottom: 0px none transparent;"> </td>
<td align="center" style=" border-bottom: 0px none transparent;"> <strong>Internal</strong> </td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"><strong>Practical</strong> </td>
</tr>
<tr>
<td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-1</strong> </td>
<td width="10%" align="center" style=" border-bottom: 0px none transparent;">4174</td>
<td width="20%" align="center" style=" border-bottom: 0px none transparent;"> </td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 29</td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;">48</td>
</tr>
<tr>
<td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-2</strong> </td>
<td width="10%" align="center" style=" border-bottom: 0px none transparent;">4175</td>
<td width="20%" align="center" style=" border-bottom: 0px none transparent;"> </td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 16</td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;">26</td>
</tr>
<tr>
<td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-3</strong> </td>
<td width="10%" align="center" style=" border-bottom: 0px none transparent;">4171</td>
<td width="20%" align="center" style=" border-bottom: 0px none transparent;"> </td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> 15</td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;">27</td>
</tr>
<tr>
<td align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-4</strong> </td>
<td align="center" style=" border-bottom: 0px none transparent;">4172</td>
<td align="center" style=" border-bottom: 0px none transparent;"> </td>
<td align="center" style=" border-bottom: 0px none transparent;"> 17</td>
<td align="center" style=" border-bottom: 0px none transparent;">29</td>
</tr>
<tr>
<td align="center" style=" border-bottom: 0px none transparent;"><strong>PSUBJECT-5</strong> </td>
<td align="center" style=" border-bottom: 0px none transparent;">4173</td>
<td align="center" style=" border-bottom: 0px none transparent;"> </td>
<td align="center" style=" border-bottom: 0px none transparent;"> 29</td>
<td align="center" style=" border-bottom: 0px none transparent;">46</td>
</tr>
<tr>
<td width="34%" align="center" style=" border-bottom: 0px none transparent;"><strong>Disipline (Deca)</strong> </td>
<td width="10%" align="center" style=" border-bottom: 0px none transparent;">4176</td>
<td width="20%" align="center" style=" border-bottom: 0px none transparent;"> </td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;"> </td>
<td width="18%" align="center" style=" border-bottom: 0px none transparent;">46</td>
</tr>
<tr><td> </td><td> </td><td> </td><td> </td><td> </td></tr></tbody>
</table>
<br><table width="783" align="center" cellpadding="5" cellspacing="0" class="table">
<tbody><tr>
<td width="18%" align="center" valign="top"><strong>Practical Marks </strong> </td>
<td width="18%" align="center" valign="top">328</td>
<td width="19%" align="center" valign="top"><strong>Theory Marks </strong> </td>
<td width="19%" align="center" valign="top">411</td>
</tr>
<tr>
<td width="18%" align="center"><strong>Institute Code </strong> </td>
<td width="18%" align="center"> 1229 </td>
<td width="19%" align="center"><strong>DECCA </strong> </td>
<td width="19%" align="center">4176</td>
</tr>
<tr>
<td width="18%" align="center"><strong>Division </strong> </td>
<td width="18%" align="center"> PASS </td>
<td width="19%" align="center"><strong>Grand Total </strong> </td>
<td width="19%" align="center">739</td>
</tr>
</tbody></table>
<!-- Reformatter by Shashank Kumar Jain (CS, IIIrd Year, 2010-11) -->
<div id="csscan-wrapper" style="display: none; "><h2 id="csscan-header">element</h2><table id="csscan-table"><tbody><tr><th colspan="2" id="csscan-header-font" class="csscan-header">Font</th></tr><tr id="csscan-row-font-family"><td id="csscan-property-font-family" class="csscan-property">font-family</td><td id="csscan-value-font-family" class="csscan-value"></td></tr><tr id="csscan-row-font-size"><td id="csscan-property-font-size" class="csscan-property">font-size</td><td id="csscan-value-font-size" class="csscan-value"></td></tr><tr id="csscan-row-font-style"><td id="csscan-property-font-style" class="csscan-property">font-style</td><td id="csscan-value-font-style" class="csscan-value"></td></tr><tr id="csscan-row-font-variant"><td id="csscan-property-font-variant" class="csscan-property">font-variant</td><td id="csscan-value-font-variant" class="csscan-value"></td></tr><tr id="csscan-row-font-weight"><td id="csscan-property-font-weight" class="csscan-property">font-weight</td><td id="csscan-value-font-weight" class="csscan-value"></td></tr><tr id="csscan-row-letter-spacing"><td id="csscan-property-letter-spacing" class="csscan-property">letter-spacing</td><td id="csscan-value-letter-spacing" class="csscan-value"></td></tr><tr id="csscan-row-line-height"><td id="csscan-property-line-height" class="csscan-property">line-height</td><td id="csscan-value-line-height" class="csscan-value"></td></tr><tr id="csscan-row-text-decoration"><td id="csscan-property-text-decoration" class="csscan-property">text-decoration</td><td id="csscan-value-text-decoration" class="csscan-value"></td></tr><tr id="csscan-row-text-align"><td id="csscan-property-text-align" class="csscan-property">text-align</td><td id="csscan-value-text-align" class="csscan-value"></td></tr><tr id="csscan-row-text-indent"><td id="csscan-property-text-indent" class="csscan-property">text-indent</td><td id="csscan-value-text-indent" class="csscan-value"></td></tr><tr id="csscan-row-text-transform"><td id="csscan-property-text-transform" class="csscan-property">text-transform</td><td id="csscan-value-text-transform" class="csscan-value"></td></tr><tr id="csscan-row-white-space"><td id="csscan-property-white-space" class="csscan-property">white-space</td><td id="csscan-value-white-space" class="csscan-value"></td></tr><tr id="csscan-row-word-spacing"><td id="csscan-property-word-spacing" class="csscan-property">word-spacing</td><td id="csscan-value-word-spacing" class="csscan-value"></td></tr><tr id="csscan-row-color"><td id="csscan-property-color" class="csscan-property">color</td><td id="csscan-value-color" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-background" class="csscan-header">Background</th></tr><tr id="csscan-row-background-attachment"><td id="csscan-property-background-attachment" class="csscan-property">bg-attachment</td><td id="csscan-value-background-attachment" class="csscan-value"></td></tr><tr id="csscan-row-background-color"><td id="csscan-property-background-color" class="csscan-property">bg-color</td><td id="csscan-value-background-color" class="csscan-value"></td></tr><tr id="csscan-row-background-image"><td id="csscan-property-background-image" class="csscan-property">bg-image</td><td id="csscan-value-background-image" class="csscan-value"></td></tr><tr id="csscan-row-background-position"><td id="csscan-property-background-position" class="csscan-property">bg-position</td><td id="csscan-value-background-position" class="csscan-value"></td></tr><tr id="csscan-row-background-repeat"><td id="csscan-property-background-repeat" class="csscan-property">bg-repeat</td><td id="csscan-value-background-repeat" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-size" class="csscan-header">Box</th></tr><tr id="csscan-row-width"><td id="csscan-property-width" class="csscan-property">width</td><td id="csscan-value-width" class="csscan-value"></td></tr><tr id="csscan-row-height"><td id="csscan-property-height" class="csscan-property">height</td><td id="csscan-value-height" class="csscan-value"></td></tr><tr id="csscan-row-border-top"><td id="csscan-property-border-top" class="csscan-property">border-top</td><td id="csscan-value-border-top" class="csscan-value"></td></tr><tr id="csscan-row-border-right"><td id="csscan-property-border-right" class="csscan-property">border-right</td><td id="csscan-value-border-right" class="csscan-value"></td></tr><tr id="csscan-row-border-bottom"><td id="csscan-property-border-bottom" class="csscan-property">border-bottom</td><td id="csscan-value-border-bottom" class="csscan-value"></td></tr><tr id="csscan-row-border-left"><td id="csscan-property-border-left" class="csscan-property">border-left</td><td id="csscan-value-border-left" class="csscan-value"></td></tr><tr id="csscan-row-margin"><td id="csscan-property-margin" class="csscan-property">margin</td><td id="csscan-value-margin" class="csscan-value"></td></tr><tr id="csscan-row-padding"><td id="csscan-property-padding" class="csscan-property">padding</td><td id="csscan-value-padding" class="csscan-value"></td></tr><tr id="csscan-row-max-height"><td id="csscan-property-max-height" class="csscan-property">max-height</td><td id="csscan-value-max-height" class="csscan-value"></td></tr><tr id="csscan-row-min-height"><td id="csscan-property-min-height" class="csscan-property">min-height</td><td id="csscan-value-min-height" class="csscan-value"></td></tr><tr id="csscan-row-max-width"><td id="csscan-property-max-width" class="csscan-property">max-width</td><td id="csscan-value-max-width" class="csscan-value"></td></tr><tr id="csscan-row-min-width"><td id="csscan-property-min-width" class="csscan-property">min-width</td><td id="csscan-value-min-width" class="csscan-value"></td></tr><tr id="csscan-row-outline-color"><td id="csscan-property-outline-color" class="csscan-property">outline-color</td><td id="csscan-value-outline-color" class="csscan-value"></td></tr><tr id="csscan-row-outline-style"><td id="csscan-property-outline-style" class="csscan-property">outline-style</td><td id="csscan-value-outline-style" class="csscan-value"></td></tr><tr id="csscan-row-outline-width"><td id="csscan-property-outline-width" class="csscan-property">outline-width</td><td id="csscan-value-outline-width" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-position" class="csscan-header">Positioning</th></tr><tr id="csscan-row-position"><td id="csscan-property-position" class="csscan-property">position</td><td id="csscan-value-position" class="csscan-value"></td></tr><tr id="csscan-row-top"><td id="csscan-property-top" class="csscan-property">top</td><td id="csscan-value-top" class="csscan-value"></td></tr><tr id="csscan-row-bottom"><td id="csscan-property-bottom" class="csscan-property">bottom</td><td id="csscan-value-bottom" class="csscan-value"></td></tr><tr id="csscan-row-right"><td id="csscan-property-right" class="csscan-property">right</td><td id="csscan-value-right" class="csscan-value"></td></tr><tr id="csscan-row-left"><td id="csscan-property-left" class="csscan-property">left</td><td id="csscan-value-left" class="csscan-value"></td></tr><tr id="csscan-row-float"><td id="csscan-property-float" class="csscan-property">float</td><td id="csscan-value-float" class="csscan-value"></td></tr><tr id="csscan-row-display"><td id="csscan-property-display" class="csscan-property">display</td><td id="csscan-value-display" class="csscan-value"></td></tr><tr id="csscan-row-clear"><td id="csscan-property-clear" class="csscan-property">clear</td><td id="csscan-value-clear" class="csscan-value"></td></tr><tr id="csscan-row-z-index"><td id="csscan-property-z-index" class="csscan-property">z-index</td><td id="csscan-value-z-index" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-list" class="csscan-header">List</th></tr><tr id="csscan-row-list-style-image"><td id="csscan-property-list-style-image" class="csscan-property">list-style-image</td><td id="csscan-value-list-style-image" class="csscan-value"></td></tr><tr id="csscan-row-list-style-type"><td id="csscan-property-list-style-type" class="csscan-property">list-style-type</td><td id="csscan-value-list-style-type" class="csscan-value"></td></tr><tr id="csscan-row-list-style-position"><td id="csscan-property-list-style-position" class="csscan-property">list-style-position</td><td id="csscan-value-list-style-position" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-table" class="csscan-header">Table</th></tr><tr id="csscan-row-vertical-align"><td id="csscan-property-vertical-align" class="csscan-property">vertical-align</td><td id="csscan-value-vertical-align" class="csscan-value"></td></tr><tr id="csscan-row-border-collapse"><td id="csscan-property-border-collapse" class="csscan-property">border-collapse</td><td id="csscan-value-border-collapse" class="csscan-value"></td></tr><tr id="csscan-row-border-spacing"><td id="csscan-property-border-spacing" class="csscan-property">border-spacing</td><td id="csscan-value-border-spacing" class="csscan-value"></td></tr><tr id="csscan-row-caption-side"><td id="csscan-property-caption-side" class="csscan-property">caption-side</td><td id="csscan-value-caption-side" class="csscan-value"></td></tr><tr id="csscan-row-empty-cells"><td id="csscan-property-empty-cells" class="csscan-property">empty-cells</td><td id="csscan-value-empty-cells" class="csscan-value"></td></tr><tr id="csscan-row-table-layout"><td id="csscan-property-table-layout" class="csscan-property">table-layout</td><td id="csscan-value-table-layout" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-effects" class="csscan-header">Effects</th></tr><tr id="csscan-row-text-shadow"><td id="csscan-property-text-shadow" class="csscan-property">text-shadow</td><td id="csscan-value-text-shadow" class="csscan-value"></td></tr><tr id="csscan-row--webkit-box-shadow"><td id="csscan-property--webkit-box-shadow" class="csscan-property">-webkit-box-shadow</td><td id="csscan-value--webkit-box-shadow" class="csscan-value"></td></tr><tr id="csscan-row-border-radius"><td id="csscan-property-border-radius" class="csscan-property">border-radius</td><td id="csscan-value-border-radius" class="csscan-value"></td></tr><tr><th colspan="2" id="csscan-header-other" class="csscan-header">Other</th></tr><tr id="csscan-row-overflow"><td id="csscan-property-overflow" class="csscan-property">overflow</td><td id="csscan-value-overflow" class="csscan-value"></td></tr><tr id="csscan-row-cursor"><td id="csscan-property-cursor" class="csscan-property">cursor</td><td id="csscan-value-cursor" class="csscan-value"></td></tr><tr id="csscan-row-visibility"><td id="csscan-property-visibility" class="csscan-property">visibility</td><td id="csscan-value-visibility" class="csscan-value"></td></tr></tbody></table></div></body></html>
the XPath
above is correct as i have checked it with FirePath
. can anyone tell me what i am doing wrong?
回答1:
Try to use loadHTML($string) instead of loadXML
. From manual:
The function parses the HTML contained in the string source. Unlike loading XML, HTML does not have to be well-formed to load.
Update 1
loadHTML
creates the same DOM tree in memory as loadXML
does. It only uses less strict parser. Here is example code with XPath:
<?php
$content = file_get_contents("1.html");
$page = new DOMDocument();
$page->loadHTML($content); // this will ignore most errors in formating
echo $page->saveHTML();
echo "=====\n";
$xpath = new DOMXPath($page); // use any "XML" parsing function
foreach ($xpath->query("//li[not(@id='3')]") as $elem) {
echo "[".trim($elem->textContent)."]\n";
}
Content of 1.html
file is:
<li id="1">item 1
<li id="2">item 2
<li id="3">item 3
<li id="4">item 4
Output will be:
<!DOCTYPE html PUBLIC "...">
<html><body>
<li id="1">item 1
</li>
<li id="2">item 2
</li>
<li id="3">item 3
</li>
<li id="4">item 4
</li>
</body></html>
=====
[item 1]
[item 2]
[item 4]
Update 2
You just missed initializing for $xpath
variable. I've also removed getXHTML
call, because it's not necessary:
$content = file_get_contents("2.html");
$page = new DOMDocument();
//$content=getXHTML($content); // no need this if you're using loadHTML
$page->loadHTML($content);
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$xpath = new DOMXPath($page); // creating $xpath object
$total = $xpath->query($totalPath);
echo "[",$total->length,"]";
回答2:
How much have you played with the PHP Tidy options? If the error you get refers to entities (specifically
) I wonder if setting numeric-entities "on" or playing with the value for preserve-entities would help.
Plan B: Try this. XPath worked even with poorly formed html files.
<?php
$oldSetting = libxml_use_internal_errors( true );
libxml_clear_errors();
$html = new DOMDocument();
$html->loadHtmlFile(
'myHtmlFile.html');
$xpath = new DOMXPath( $html );
$test = $xpath->query( "//div[@id='mydiv']" );
$div = $test->item(0);
echo $div->getAttribute('style');
libxml_clear_errors();
libxml_use_internal_errors( $oldSetting );
?>
回答3:
the answer to the above question somewhat tricky. my original code looked something like
$xpath=new DOMXPath($page);
..
...
...
$page->loadHTML($content);
..
...
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$total = $xpath->query($totalPath);
...
...
what happens above is that $xpath
is created on an empty document because the html
is still not loaded in the Dom
. so when xpath ran any query it ran the query on an empty document.
now i changed the order of the 2 statements
...
...
$page->loadHTML($content);
$xpath=new DOMXPath($page);
...
...
$totalPath = "//body/table[3]/tbody/tr[1]/td[4]";
$total = $xpath->query($totalPath);
now it works because $xpath
is created on a nonempty document
来源:https://stackoverflow.com/questions/6221979/xpath-not-working-on-the-html