How can I make LWP::UserAgent look like another browser?

一笑奈何 提交于 2019-12-04 03:30:14

Getting a different webpage with scraping

We have to make one assumption, the web-server will return the same output if given the same input. With this assumption we inescapably come to the conclusion we're not giving it the same input. There are two browsers, or http clients in this scenario: the one that is giving you the result you want (ex., Firefox, IE, Chrome, or Safari), and the one that is not giving you the result you want (ex., LWP, wget, or cURL).

Kill off the easy possibilities first

Before, continuing firstly make sure the simple UserAgents are the same, you can do this by browsing to whatsmyuseragent.com and setting the UserAgent string in the header of the other browser to whatever that website returns. You can also use Firefox's Web Developer's Toolbar to disable CSS, and JavaScript, Java, and meta-redirects: this will help you track down the problem by killing off the really simple stuff.

Now attempt to duplicate the working browser

Now with Firefox you can use FireBug to analyze the REQUEST that is sent. You can do this under the NET tab in FireBug, different browsers should have tools that can do what FireBug does with FireFox; however, if you don't know the tool in question you can still use tshark or wireshark as described below. It is important to note that tshark and wireshark will always be more accurate because they work at a lower level which at least in my experience leaves less room for error. For example, you'll see things like meta-redirects the browser is doing which sometimes FireBug can lose track of.

After you understand the first web-request that works, do your best to set the second web-request to that of the first. By this I mean setting the request-headers properly and other request elements. If this still doesn't work you have to know what the second browser is doing to see what is wrong.

Troubleshooting

In order to troubleshoot this, we must have a total understanding of the requests from both browsers. The second browser is usually tricker, these are often libraries and non-interactive command line browsers that lack the ability to check the request. If they have the ability to dump the request you might still opt to simply check them anyway. To do this I suggest the wireshark and tshark suite. Immediately, you should be warned that because these operate below the browser. By default, you'll see the actual network (IP) packets, and data-link frames. You can filter out what you need specifically with a command like this.

sudo tshark -i <interface> -f tcp -R "http.request" -V |
perl -ne'print if /^Hypertext/../^Frame/'

This will capture all of the TCP packets, display-filter only the http.requests, then perl filter for only layer 4 HTTP stuff. You might want to add to the display filter to only grab a single web server too -R "http.request and http.host == ''"

You're going to want to check everything to see if the two requests are in line, cookies, GET url, user-agent, etc. Make sure the site doesn't do something goofy.

Updated Jan 23 2010: Based on the new information I would suggest setting Accept, and Accept-Language, Accept-Charset and Accept-Encoding. You can do that with through $ua->default_headers(). If what you demand is a lot more functionality out of your useragent, you can always subclass it. I took this aproach for my GData API, you can find my example on of a UserAgent subclass on github.

You should probably look at WWW::Mechanize, which is a subclass of LWP::UserAgent that is oriented towards that sort of website automation. In particular, see the agent_alias method.

Some websites do block connections based on the User-Agent, but you can set that to whatever you want using Perl. It's possible that a website might also look for other request headers normally generated by a particular browser (like the Accept header) and refuse connections that don't include them, but you can add those headers too, if you figure out what it's looking for.

In general, it's impossible for a website to prevent a different client from impersonating a supported browser. No matter what it's looking for, you can eventually duplicate it.

It's also possible that it's looking for JavaScript support. In that case, you might look at WWW::Scripter, which is a subclass of WWW::Mechanize that adds JavaScript support. It's fairly new and I haven't tried it yet.

This thread is almost certainly not about merely changing User Agent.

I see two paths. Either we can experiment with turning off javascript and css in browser, and learn more about getting into HTTP::Request and HTTP::Response objects while relying on LWP::UserAgent, or, go to WWW::Scripter and use javascript.

Just in crude Craigslist text ads, there are three pages of densely packed, almost space-free javascript and css, and then they load more and specialized code so that if I come in by comcast I then find special javascript, just targeting comcast users, has been loaded into the final page. The way they do that is in their attempt to break robots by putting code in the HEAD which lawyers the diff between HTML 1.0 and 1.1 to say, oh, there is something a little bit wrong, you need an http refresh, and then porking you with extra code to snoop out isp and who knows what, cookie info for sure(you can print out cookies at every turn when you learn how to slow LWP down and insert callback code to snoop like *shark but inside perl, also see how server keeps trying to change "your" headers and "your" request--re-negotiate "your" request--oh you don't want to buy a cheap car you want to buy a Maserati and mortgage your house to do it i.e. snoop your ISP and why not your Contacts and all your google history!!! Who knows?!).

CL puts a random ID number into Alice's HEAD, then whispers that you need an http request to swallow the red pill, stop hiding it under your tongue. That way most robots choke and accept a fake sanitized page i.e. truncated "home page". Also, if I scrape url's from the page, I can't "click" on them using LWP because I never learned my ID, nor did I learn the javascript to parrot the ID back in javascript before a $ua->get( $url&ID=9dd887f8f89d9" ); or maybe the simple get would work with &ID. It's way more than User Agent but you can do it and you're getting all the help you need from

As you can see, the first path is to turn all that off and see if you can learn your re-negotiated request's URI, not original URL but URI. Then get it, no javascript, no WWW::Scripter. It sounds like LWP will work for you. I would like to hear more about changing ACCEPT's in default_header initially, and whether server says, oh, you mean ACCEPT this and this and this, swallow red pill in re-negotiate Request object. You can snoop that by inserting callbacks in request and response conversation.

Second path, WWW::Scripter, is only if we decided to swallow the Red Pill, and go down Alice's Rabbit Hole aka Matrix. perl philosophy dictates exhausting other possibilities before working harder. Otherwise we wouldn't have learned our 101 http prereqs, so escalating to bigger hammer would be just that, or dropping acid for aspirin, or not?

I tried a number of different values for

$ua->agent("");

but nothing nothings seems to work.

Well, would you like to tell us what those things you tried were?

What I normally do is type

javascript:prompt('your agent string is',navigator.userAgent)

into my regular browser's URL bar, hit enter, and cut and paste what it tells me. Surely using wireshark and monitoring actual packets is overkill? The website you're trying to get to has no way of knowing you're using Perl. Just tell it whatever it expects to hear.

Tools: Firefox with TamperData and LiveHTTPHeaders, Devel::REPL, LWP.

Analysis: In the browser, turn off Javascript and Java, delete any cookies from the target web site, start TamperData logging, log in to web site. Stop TamperData logging and look back through the many requests you likely placed during the login process. Find the first request (the one you made on purpose) and look at its details.

Experimentation: Start re.pl, and start recreating the browser's interaction.

use LWP::UserAgent;

my $ua = LWP::UserAgent->new(
  agent      => $the_UA_of_the_browser,
  cookie_jar => HTTP::Cookies->new(hide_cookie2 => 1),
);
$ua->default_headers(HTTP::Headers->new(
  %the_headers_sent_by_the_browser,
));

my $r = $ua->get($the_URL);
$r->content($r->decoded_content); print $r->as_string;

So that's step one. If you get mismatched responses at any point, you did something wrong. You can usually[1] find out what by looking at $r->request and comparing with the request Firefox sent. The important thing is to remember that there is no magic and that you know everything the server knows. If you can't get the same response to what appears to be the same request, you missed something.

Getting to the first page is usually not enough. You'll likely need to parse forms (with HTML::Form), follow redirects (as configured above, UA does that automatically, but sometimes it pays to turn that off and do it by hand), and try to reverse engineer a weirdly-hacked-together login sequence from the barest of hints. Good luck.

[1]: Except in the case of certain bugs in LWP's cookies implementation that I won't detail here. And even then you can spot it if you know what you're looking for.

Is your perl script running on the same machine as the firefox browser you reference? It could be filtering based on subnet or incoming IP address. Your url is https, so there could be also be some PSK (pre shared key) or certificate loaded on you browser taht the server is expecting. Extremely unlikely outside of an internal companies intranet site.

HoldOffHunger

I just noticed something. This line:

my $res = $ua->request(GET $url);

It doesn't work on my machine at all. But I got it to work by changing it to:

my $res = $ua->get($url);

adding the referrer portion made it work for me:

$req = HTTP::Request->new(GET => $url);
$req->header(Accept => "text/html, */*;q=0.1", referer => 'http://google.com');
$res = $ua->request($req);
print $res->status_line;
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!