问题
I'm creating a script in Python 3 which access a page like:
example.com/daora/zz.asp?x=qqrzzt
using the urllib.request.urlopen("example.com/daora/zz.asp?x=qqrzzt"), but this code just give me the same page(example.com/daora/zz.asp?x=qqrzzt) and on the browser i get a redirect to a page like:
example.com/egg.aspx
What could i do to retrieve the
example.com/egg.aspx
and not the
example.com/daora/zz.asp?x=qqrzzt
I think this is relevant code, this is the code from "example.com/daora/zz.asp?x=qqrzzt":
<head>
<script language="JavaScript">
<!--
function Submit()
{
document.formzz.submit();
}
-->
</script>
</head>
<body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onLoad="javascript:Submit();">
<form name="formZZ" method="post" action="http://example.com/egg.aspx">
<input type="hidden" name="token" value="UFASGFJKASGDJFGAJS">
</form>
回答1:
urllib.request
follows redirects automatically; you don't need to do anything.
The problem here is that there is no redirect to follow. The web page uses Javascript to fake a form submission as soon as it's loaded. urllib
just fetches the page; it doesn't implement a browser DOM and run Javascript code.
Depending on how general you need your script to be, the simplest solution may be something hacky. For example, if you're just trying to spider 500 pages that all have a similar structure but different details, just find the action
of the first form
and navigate to that.
Also, if fetching the pages and processing them are two distinct steps, you may want to write a fetcher with super-simple Javascript/Greasemonkey (running in the browser, so it's already got a working DOM implementation, etc.) and a separate fancy processing script in Python (which just operates on the finally-fetched/generated HTML pages).
If you need to be fully general, the simplest solution is probably to use the selenium browser automation framework. (Or, maybe, PyWin32 or PyObjC to automate IE or Webkit directly.)
If you want the best possible solution, and have infinite resources… write your own implementation of the DOM and hook up your favorite Javascript interpreter (probably spidermonkey or v8). That's only about 2/3rds as much work as writing a new browser. (And you may be able to find pieces that get you 80% of the way there. For example, if you're willing to use Jython instead of CPython as your Python interpreter, HtmlUnit is pretty slick.)
来源:https://stackoverflow.com/questions/16157719/how-to-follow-a-redirect-with-urllib