Edit: Why the minus one?
What I am trying to do is the following:
As you're building a scraper, you can create your own classes to work for what you need to do in your domain. You can start by creating your own set of request and response classes that deal with what you need to deal with.
Creating your own request class will allow you to implement the curl request the way you need it. Creating your own response class can you help you access/parse the returned HTML.
This is a simple usage example of some classes I've created for a demo:
# simple get request
$request = new MyRequest('http://hakre.wordpress.com/');
$response = new MyResponse($request);
foreach($response->xpath('//div[@id="container"]//div[contains(normalize-space(@class), " post ")]') as $node)
{
if (!$node->h2->a) continue;
echo $node->h2->a, "\n<", $node->h2->a['href'] ,">\n\n";
}
It will return my blogs posts:
Will Automattic join Dec 29 move away from GoDaddy day?
<http://hakre.wordpress.com/2011/12/23/will-automattic-join-dec-29-move-away-from-godaddy-day/>
PHP UTF-8 string Length
<http://hakre.wordpress.com/2011/12/13/php-utf-8-string-length/>
Title belongs into Head
<http://hakre.wordpress.com/2011/11/02/title-belongs-into-head/>
...
Sending a get request then is easy as pie, the response can be easily accessed with an xpath expression (here SimpleXML). XPath can be useful to select the token from the form field as it allows you to query data of the document more easily than with a regular expression.
Sending a post request was the next thing to build, I tried to write a login script for my blog and it turned out to work quite well. I needed to parse response headers as well, so I added some more routines to my request and response class.
# simple post request
$request = new MyRequest('https://example.wordpress.com/wp-login.php');
$postFields = array(
'log' => 'username',
'pwd' => 'password',
);
$request->setPostFields($postFields);
$response = new MyResponse($request->returnHeaders(1)->execute());
echo (string) $response; # output to view headers
Considering your scenario you might want to edit your own request class to better deal with what you need, mine already uses cookies as you're using them, too. So some code based on these classes for your scenario could look like:
# input values
$url = '<schoolsite>';
$user = '<number>';
$password = '<secret>';
# execute the first get request to obtain token
$response = new MyResonse(new MyRequest($url));
$token = (string) $response->xpath('//input[@name="token"]/@value');
# execute the second login post request
$request = new MyRequest($url);
$postFields = array(;
'user' => $user,
'password' => $password,
'token' => $token
);
$request->setPostFields($postFields)->execute();
Demo and code as gist.
If you want to further improve this, the next step is that you create yourself a class for the "school service" that you make use of to fetch the schedule from:
class MySchoolService
{
private $url, $user, $pass;
private $isLoggedIn;
public function __construct($url, $user, $pass)
{
$this->url = $url;
...
}
public function getSchedule()
{
$this->ensureLogin();
# your code to obtain the schedule, e.g. in form of an array.
$schedule = ...
return $schedule;
}
private function ensureLogin($reuse = TRUE)
{
if ($reuse && $this->isLoggedIn) return;
# execute the first get request to obtain token
$response = new MyResonse(new MyRequest($this->url));
$token = (string) $response->xpath('//input[@name="token"]/@value');
# execute the second login post request
$request = new MyRequest($this->url);
$postFields = array(;
'user' => $this->user,
'password' => $this->password,
'token' => $token
);
$request->setPostFields($postFields)->execute();
$this->isLoggedIn = TRUE;
}
}
After you've nicely wrapped the request/response logic into your MySchoolService
class you only need to instantiate it with the proper configuration and you can easily use it inside your website:
$school = new MySchoolService('<schoolsite>', '<number>', '<secret>');
$schedule = $school->getSchedule();
Your main script only uses the MySchoolService
.
The MySchoolService
takes care of making use of MyRequest
and MyResponse
objects.
MyRequest
takes care of doing HTTP requests (here with cUrl) with cookies and such.
MyResponse
helps a bit with parsing HTTP responses.
Compare this with a standard internet browser:
Browser: Handles cookies and sessions, does HTTP requests and parses responses.
MySchoolService: Handles cookies and sessions for your school, does HTTP requests and parses responses.
So you now have a school browser in your script that does what you want. If you need more options, you can easily extend it.
I hope this is helpful, the starting point was to prevent written the same lines of cUrl code over and over again and as well to give you a better interface to parse return values. The MySchoolService
is some sugar on top that make things easy to deal with in your own website / application code.
What's the error message you get? Independently of that; your school's website might check the referrer header and make sure that the request is coming from (an application pretending to be...) its login page.
This is how I solved it. The problem was probably the 'not-using-cookies' part. Still this is probably 'ugly' code, so any improvements are welcome!
// This part is for retrieving the token from the hidden field.
// To be honest, I have no idea what the cookie lines actually do, but it works.
$getToken= curl_init();
curl_setopt($getToken, CURLOPT_URL, '<schoolsite>'); // Set the link
curl_setopt($getToken, CURLOPT_COOKIEJAR, 'cookies.txt'); // Magic
curl_setopt($getToken, CURLOPT_COOKIEFILE, 'cookies.txt'); // Magic
curl_setopt($getToken, CURLOPT_RETURNTRANSFER, 1); // Return only as a string
$data = curl_exec($token); // Perform action
// Close the connection if there are no errors
if(curl_errno($token)){print curl_error($token);}
else{curl_close($token);}
// Use a regular expression to fetch the token
$regex = '/name="token" value="(.*?)"/';
preg_match($regex,$data,$match);
// Put the login info and the token in a post header string
$postfield = "token=$match[1]&user=<number>&paswoord=<mine>";
echo($postfields);
// This part is for logging in and getting the data.
$site = curl_init();
curl_setopt($site, CURLOPT_URL, '<school site');
curl_setopt($site, CURLOPT_COOKIEJAR, 'cookies.txt'); // Magic
curl_setopt($site, CURLOPT_COOKIEFILE, 'cookies.txt'); // Magic
curl_setopt($site, CURLOPT_POST, 1); // Use POST (not GET)
curl_setopt($site, CURLOPT_POSTFIELDS, $postfield); // Insert headers
$forevil_uuh_no_GOOD_purposes = curl_exec($site); // Output the results
// Close connection if no errors
if(curl_errno($site)){print curl_error($site);}
else{curl_close($site);}