问题
(Edit: I've still found no way of solving this problem. The $crawler
object seems ridiculous to work with, I just want to parse it for a specific <td>
text, how hard is that? I cannot serialize()
the entire crawler object either and make the entire source code for the web page into a string, or else I could just parse that string the hard way. Please help. I feel I've described the problem well, below.)
Below I'm using Symfony, Goutte, and DomCrawler to scrape a web page. I've been trying to figure it out through other questions with no success, but now I'm just going to post all my code to make this as straight forward as possible.
I am able to get the page and get the first bit of data I'm looking for. The first is a url that is printed from javascript and lies withing an a
tag with an onclick
and is a long string, so I use a preg_match
to sift through and get exactly what I need.
The next bit of data I need is some text within a <td>
tag. The thing is, this web page has 10-20 different <table>
tags, and there are no id=""
or class=""
tags so it's hard to isolate. So what I'm trying to do is search for the words "Event Title" then go to the next sibling <td>
tag and extract the innerHtml of that, which will be the actual title.
The problem is that for the second part I can't seem to parse properly through the $crawler
object. I don't understand, I did a preg_match
before on a serialize()
version of the $crawler
object, but for the bottom half I can't seem to parse through properly.
$crawler = $client->request('GET', 'https://movies.randomjunk.com/events/EventServlet?ab=mov&eventId=154367');
$aurl = 'http://movies.randomjunk.com/r.htm?e=154367'; // event url beginning string
$gas = $overview->filter('a[onclick*="' . $aurl . '"]');
$string1 = serialize($gas->filter('a')->attr('onclick')); //TEST
$string1M = preg_match("/(?<=\')(.*?)(?=\')/", $string1, $finalURL);
$aString = $finalURL[0];
echo "<br><br>" . $aString . "<br><br>";
// IT WORKS UP TO HERE
// $title = $crawler->filterXPath('//td[. = "Event Title"]/following-sibling::td[1]')->each(funtion (Crawler $crawler, $i) {
// return $node->text();
// }); // No clue why, but this doesn't work.
$html = $overview->getNode(0)->ownerDocument->saveHTML();
$re = "/>Event\sTitle.*?<\\/td>.*?<td>\\K.*?(?=<\\/td>)/s";
$str = serialize($html);
print_r($str);
preg_match_all($re, $str, $matches);
$gas2 = $matches[0];
echo "<pre>";
print_r($gas2);
echo "</pre>";
My preg_match
just returns an empty array. I think it's a problem with searching the $crawler
object, since it's made up of many nodes. I've been trying to just convert it all to html then to a preg_match
but it just refuses to work. I've done a few print_r
statements, and it just returns the whole web page.
Here's an example of some of the html in side the crawler object:
{lots of other html and tables}
<table>
<tr>
<td>Title</td>
<td>The Harsh Face of Mother Nature</td>
<td>The Harsh Face of Mother Nature</td>
</tr>
.
.
</table>
{lots of other html and tables}
And the goal is to parse through the entire page/$crawler
object and get the title "The Harsh Face of Mother Nature".
I know this must be possible, but the only answer anyone wants to provide is a link to the domcrawler page which I've read about a thousand times at this point. Please help.
回答1:
Given the html fragment above I was able to come up with the XPath of:
//table/tr/td[.='Title']/following-sibling::td[1]
You can test the XPath with your provided html fragment at Here
$html = '<table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table>';
$crawler = new Symfony\Component\DomCrawler\Crawler($html);
$query = "//table/tr/td[.='Event Title']/following-sibling::td[1]";
$crawler->filterXPath($query)->each(function($crawler, $i) {
echo $crawler->text() . PHP_EOL;
});
Which outputs:
The Harsh Face of Mother Nature
The Harsh Face of Mother Nature
The Harsh Face of Mother Nature
Update: Tested successfully with:
$html = '<html><table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table><table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table><table><tr><td>Event Title</td><td>The Harsh Face of Mother Nature</td><td>The Harsh Face of Mother Nature</td></tr></table></html>';
Update: After being provided with sample html from the website I was able to get things to parse with the following XPath:
//td[normalize-space(text()) = 'Event Title']/following-sibling::td[1]
The real issue was the leading and trailing white space that was around "Event Title".
回答2:
Alright , what you can do is using a class in your :
<td class="mytitle">The Harsh Face of Mother Nature</td>
Which you will use to filter your crawler to get all your titles in an array like this :
$titles = $crawler->filter('td.mytitle')->extract(array('_text'));
where td.mytitle is a css selector, select td with mytitle class and extract _text which refer to the text inside the node.
Easy and more performant than regex...
Didn't tested this code but it should work, you can get more help and more informations about the crawler here :
http://symfony.com/fr/doc/current/components/dom_crawler.html
来源:https://stackoverflow.com/questions/29282785/web-scrape-symfony2-impossible-challenge-crawler-parsing