问题
urllib2.urlopen("http://www.someURL.com/pageTracker.html").read();
The code above will return the source HTML at http://www.google.com.
What do I need to do to actually return the rendered HTML that you see when you visit google.com? I essentially trying to 'execute' a URL to trigger a view, not retrieve the HTML.
To clarify a few things:
- I'm not actually concerned about the visual output of the page
- I'm concerned about the page rendering as it would inside of a proper browser so that I can track a Google Analytics goal via the JavaScript on that page.
回答1:
Because Google home page somewhat relies on JavaScript, you cannot get rendered HTML with a simple HTTP request / HTML parsing library, as these do not run the JavaScript enhancements on the page. Only web browsers render HTML, so you need a browser to get the rendered HTML.
Instead of simple HTTP request library, you need to use a full-blown headless web browser library.
One available option is Selenium and its WebDriver.
https://pypi.python.org/pypi/selenium
Open a page in Selenium. See PyPi for the example.
Wait some time with
time.sleep()
to make sure all resource are loaded and JavaScript-based DOM modifications settle. The delay depends on the web page, I suggest you experiement with different values.You can issue a JavaScript command to the Selenium driver to return the DOM tree of currently loaded page:
driver.execute_script("return document.innerHTML")
回答2:
You might want to try https://code.google.com/p/pywebkitgtk/. Using PyWebkit you can create a rendered view of the HTML page.
Rendering a web page is not an trivial task as web technology is changing constantly. Several rendering engines exist. Two of them are the most prominent: Webkit (Chrome/Chromium, Safari) and Gecko (Firefox). Also there is Trident (Internet Explorer) and Blink (Opera).
Google.com also contains Javascript which needs to be interpreted. It should render fine without Javascript, but probably will look differently.
来源:https://stackoverflow.com/questions/20622870/using-urllib2-to-execute-url-and-return-rendered-html-output-not-the-html-itsel