问题
I want to crawl some specific values (e.g.newstext) from a website (which is not my own).
file_get_contents()
is not working, propably blocked by php.ini.
So i tried to do it with curl, problem is:
All I get is the redirection text from cloudflare.
My crawler should do something like:
go to page -> wait the 5secs cloudflare redirect -> curl the page.
Any ideas how to crawl the page after the cloudfare waiting time? (in PHP)
edit: so i tried a lot of things, problem is still the same..
more specific: it only crawls the cloudflare redirect page. (so i'm getting a page which redirects to the host, cloudflare is in front. when i curl on localhost it takes localhost, so redirect is obv not working.)
Is there no way to start saving returend data after 5secs "curling"?
回答1:
"go to page -> wait the 5secs cloudflare redirect -> curl the page."
The 5 second interstitial page actually requires that JavaScript and cookies are enabled before a visitor can pass the check, which probably won't work if you're using a crawler or bot to access the site.
回答2:
First you should check how normal browser behave on this site. What are redirects and cookies.
Then, you need to set up curl script that collects all cookies in "cookie jar" and auto follow redirects.
Then you should do some tests.
Hope this helps.
Note:
Cloudflare have good infrastructure to block people like you. They could do captcha challenge or something similar.
Also good system administrator soon or later will find what you are doing and will block your IP or your user-agent.
回答3:
You should use phantomjs
echo shell_exec('phantomjs example.js')
example.js
var page = require('webpage').create();
var url = 'http://www.google/';
page.open(url, function (status) {
console.log(page.content)
phantom.exit();
});
来源:https://stackoverflow.com/questions/31182100/php-crawl-a-website-which-is-using-cloudflare