How to extract innermost table from html file with the help of the html agility pack?

问题

I am parsing the tabular information from the html file with the help of the html agility pack.

Now I can do it and it works.

But when the table what I want to extract is inner most.

Or I don't know at which position it is in nested tables.And there can be any number of nested tables and from that I want to extract the information of the table which has column name name,address.

Ex.

<table>
    <table>
           <tr><td>PHONE NO.</td><td>OTHER INFO.</td></tr>
           <tr><td>
              <table>
                 <tr><td>AMOUNT</td></tr>
                 <tr><td>50000</td></tr>
                 <tr><td>80000</td></tr>
              </table>
           </td></tr>
           <tr><td>
              <table>
                 <tr><td>
                     <table>
                         <tr><td>
                              <table>
                                 <tr><td> NAME </td><td>ADDRESS</td>
                                 <tr><td> ABC  </td><td> kfks   </td>
                                 <tr><td> BCD  </td><td> fdsa   </td>
                              </table>
                         </tr></td>
                     </table>
                 </td></tr>
              </table>
           </td></tr>
        </table>

There are many tables but I want to extract the table which has column name name,address. So what should I do ?

回答1:

var table = doc.DocumentNode.SelectSingleNode("//table [not(descendant::table) and tr[1]/td[normalize-space()='ADDRESS'] ]");

回答2:

Load the document as a HtmlDocument. Then use an XPath query to find a table that contains no other tables and which has a td in the first row containing "Name".

The XPath implementation is the standard .NET one from System.Xml.XPath, so any documentation about using XPath with XmlDocument will be applicable.

HtmlDocument doc = new HtmlDocument();
doc.Load("file.html");
HtmlNode el = (HtmlNode) doc.DocumentNode.SelectSingleNode("//table[not(descendant::table) and tr[1]/td['NAME' = normalize-space()]]");

If the "Name" column was fixed, you could use something like 'Name' = normalize-space(tr[1]/td[2]).

To find a table based on several column names, but not the inner most table condition.

HtmlNode el = (HtmlNode) doc.DocumentNode.SelectSingleNode("//table[tr[1]/td['NAME' = normalize-space()] and tr[1]/td['ADDRESS' = normalize-space()]]");

来源：https://stackoverflow.com/questions/2550512/how-to-extract-innermost-table-from-html-file-with-the-help-of-the-html-agility

标签

.net

winforms

xpath

html-agility-pack