Is there a PHP equivalent of Perl's WWW::Mechanize?

痴心易碎 提交于 2019-11-26 12:22:24
troelskn

SimpleTest's ScriptableBrowser can be used independendly from the testing framework. I've used it for numerous automation-jobs.

I feel compelled to answer this, even though its an old post... I've been working with PHP curl a lot and it is not as good anywhere near comparable to something like WWW:Mechanize, which I am switching to (I think I am going to go with the Ruby language implementation).. Curl is outdated as it requires too much "grunt work" to automate anything, the simpletest scriptable browser looked promising to me but in testing it, it won't work on most web forms I try it on... honestly, I think PHP is lacking in this category of scraping, web automation so its best to look at a different language, just wanted to post this since I have spent countless hours on this topic and maybe it will save someone else some time in the future.

It's 2016 now and there's Mink. It even supports different engines from headless pure-PHP "browser" (without JavaScript), over Selenium (which needs a browser like Firefox or Chrome) to a headless "browser.js" in NPM, which DOES support JavaScript.

moo

Try looking in the PEAR library. If all else fails, create an object wrapper for curl.

You can so something simple like this:

class curl {
    private $resource;

    public function __construct($url) {
        $this->resource = curl_init($url);
    }

    public function __call($function, array $params) {
        array_unshift($params, $this->resource);
        return call_user_func_array("curl_$function", $params);
    }
}

Try one of the following:

(Yes, it's ZendFramework code, but it doesn't make your class slower using it since it just loads the required libs.)

Curl is the way to go for simple requests. It runs cross platform, has a PHP extension and is widely adopted and tested.

I created a nice class that can GET and POST an array of data (INCLUDING FILES!) to a url by just calling CurlHandler::Get($url, $data) || CurlHandler::Post($url, $data). There's an optional HTTP User authentication option too :)

/**
 * CURLHandler handles simple HTTP GETs and POSTs via Curl 
 * 
 * @package Pork
 * @author SchizoDuckie
 * @copyright SchizoDuckie 2008
 * @version 1.0
 * @access public
 */
class CURLHandler
{

    /**
     * CURLHandler::Get()
     * 
     * Executes a standard GET request via Curl.
     * Static function, so that you can use: CurlHandler::Get('http://www.google.com');
     * 
     * @param string $url url to get
     * @return string HTML output
     */
    public static function Get($url)
    {
       return self::doRequest('GET', $url);
    }

    /**
     * CURLHandler::Post()
     * 
     * Executes a standard POST request via Curl.
     * Static function, so you can use CurlHandler::Post('http://www.google.com', array('q'=>'StackOverFlow'));
     * If you want to send a File via post (to e.g. PHP's $_FILES), prefix the value of an item with an @ ! 
     * @param string $url url to post data to
     * @param Array $vars Array with key=>value pairs to post.
     * @return string HTML output
     */
    public static function Post($url, $vars, $auth = false) 
    {
       return self::doRequest('POST', $url, $vars, $auth);
    }

    /**
     * CURLHandler::doRequest()
     * This is what actually does the request
     * <pre>
     * - Create Curl handle with curl_init
     * - Set options like CURLOPT_URL, CURLOPT_RETURNTRANSFER and CURLOPT_HEADER
     * - Set eventual optional options (like CURLOPT_POST and CURLOPT_POSTFIELDS)
     * - Call curl_exec on the interface
     * - Close the connection
     * - Return the result or throw an exception.
     * </pre>
     * @param mixed $method Request Method (Get/ Post)
     * @param mixed $url URI to get or post to
     * @param mixed $vars Array of variables (only mandatory in POST requests)
     * @return string HTML output
     */
    public static function doRequest($method, $url, $vars=array(), $auth = false)
    {
        $curlInterface = curl_init();

        curl_setopt_array ($curlInterface, array( 
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => 1,
            CURLOPT_FOLLOWLOCATION =>1,
            CURLOPT_HEADER => 0));
        if (strtoupper($method) == 'POST')
        {
            curl_setopt_array($curlInterface, array(
                CURLOPT_POST => 1,
                CURLOPT_POSTFIELDS => http_build_query($vars))
            );  
        }
        if($auth !== false)
        {
              curl_setopt($curlInterface, CURLOPT_USERPWD, $auth['username'] . ":" . $auth['password']);
        }
        $result = curl_exec ($curlInterface);
        curl_close ($curlInterface);

        if($result === NULL)
        {
            throw new Exception('Curl Request Error: '.curl_errno($curlInterface) . " - " . curl_error($curlInterface));
        }
        else
        {
            return($result);
        }
    }

}

?>

[edit] Read the clarification only now... You probably want to go with one of the tools mentioned above that automates stuff. You could also decide to use a clientside firefox extension like ChickenFoot for more flexibility. I'll leave the example class above here for future searches.

If you're using CakePHP in your project, or if you're inclined to extract the relevant library you can use their curl wrapper HttpSocket. It has the simple page-fetching syntax you describe, e.g.,

# This is the sugar for importing the library within CakePHP       
App::import('Core', 'HttpSocket');
$HttpSocket = new HttpSocket();

$result = $HttpSocket->post($login_url,
array(
  "username" => "username",
  "password" => "password"
)
);

...although it doesn't have a way to parse the response page. For that I'm going to use simplehtmldom: http://net.tutsplus.com/tutorials/php/html-parsing-and-screen-scraping-with-the-simple-html-dom-library/ which describes itself as having a jQuery-like syntax.

I tend to agree that the bottom line is that PHP doesn't have the awesome scraping/automation libraries that Perl/Ruby have.

Lucas Oman

If you're on a *nix system you could use shell_exec() with wget, which has a lot of nice options.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!