Need help scraping webpage — getting specific content…

社会主义新天地 提交于 2019-12-25 08:49:33

问题


I have a table, of whose number of columns can change depending on the configuration of the scrapped page (I have no control of it). I want to get only the information from a specific column, designated by the columns heading.

Here is a simplified table:

<table>
<tbody>
<tr class='header'>
    <td>Image</td>
    <td>Name</td>
    <td>Time</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 1</td>
    <td>13:02</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 2</td>
    <td>13:43</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 3</td>
    <td>14:53</td>
</tr>
</tbody>
</table>

I want to only extract the names (column 2) of the table. However, as previously stated, the column order cannot be known. The Image column might not be there, for example, in which case the column I want would be the first one.

I was wondering if there's any way to do this with DomDocument/DomXPath. Perhaps search for the string "Name" in the first tr, and find out which column index it is, and then use that to get the info. A less elegant solution would be to see if the first column has an img tag, in which case the image column is first and so we can throw that way and use the next one.

Been looking at it for about an hour and a half, but I'm not familiar to DomDocument functions and manipulation. Having a lot of trouble with this one.


回答1:


Simple HTML DOM Parser may be useful. You can check the manual. Basically you should use something like;

$url = "file url";
$html = file_get_html($url);
$header = $html->find('tr.header td');
$i = 0;
foreach ($header as $element){
 if ($element->innerText == 'Image') { $num = $i; }
 $i++;
}

We found which column ($num) is image column. You can add additional codes to improve.

PS: Easy way to find all image sources;

$images = $html->find('tr td img');
foreach ($images as $image){
 $imageUrl[] = $image->src;
}


来源:https://stackoverflow.com/questions/6862581/need-help-scraping-webpage-getting-specific-content

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!