Make a JavaScript-aware Crawler

问题

I want to make a script that's crawling a website and it should return the locations of all the banners showed on that page.

The locations of banners are most of the time from known domains. But banners are not in the HTML as an easy image or swf-file. Most of the times a Javascript is used to show the banner.

So if a .swf-file or image-file is loaded from a banner-domain, it should return that url.

Is that possible to do? And how could I do that roughly?

Best would be if it can also returns the landing page of that ad. How to solve that?

回答1:

You could use selenium to open the pages in a real browser and then access the DOM.

PhantomJS might also be worth a look - it's a headless version of WebKit (the engine behind Chrome, Safari, etc.).

However, none of those solutions are pure php - if that's a requirement, you'll probably have to write your own JavaScript engine in PHP (which is nothing I'd ask my worst enemy to do ;))

回答2:

In order to get the output of the JavaScript you will need a JavaScript engine (such as Google's V8 Engine). The V8 engine is written in C++ but there are some resources that tell you embed the V8 engine into PHP.

With that said, you have to study the output "by hand" and determine exactly what can be scraped and how to identify it. Once you've identified some common syntax for the advertisement banners, then you can write a script to extract the banner and the landing page which is referenced.

None of this is easy work, but if you have an example of an ad you'd like to collect then I can give you more advice.

来源：https://stackoverflow.com/questions/8326301/make-a-javascript-aware-crawler

标签

php

web-crawler

ads