using urllib2 to execute URL and return rendered HTML output, not the HTML itself [duplicate]

和自甴很熟 提交于 2020-01-16 01:01:21

问题


urllib2.urlopen("http://www.someURL.com/pageTracker.html").read();

The code above will return the source HTML at http://www.google.com.

What do I need to do to actually return the rendered HTML that you see when you visit google.com? I essentially trying to 'execute' a URL to trigger a view, not retrieve the HTML.

To clarify a few things:

  • I'm not actually concerned about the visual output of the page
  • I'm concerned about the page rendering as it would inside of a proper browser so that I can track a Google Analytics goal via the JavaScript on that page.

回答1:


Because Google home page somewhat relies on JavaScript, you cannot get rendered HTML with a simple HTTP request / HTML parsing library, as these do not run the JavaScript enhancements on the page. Only web browsers render HTML, so you need a browser to get the rendered HTML.

Instead of simple HTTP request library, you need to use a full-blown headless web browser library.

One available option is Selenium and its WebDriver.

https://pypi.python.org/pypi/selenium

  1. Open a page in Selenium. See PyPi for the example.

  2. Wait some time with time.sleep() to make sure all resource are loaded and JavaScript-based DOM modifications settle. The delay depends on the web page, I suggest you experiement with different values.

  3. You can issue a JavaScript command to the Selenium driver to return the DOM tree of currently loaded page:

    driver.execute_script("return document.innerHTML")
    



回答2:


You might want to try https://code.google.com/p/pywebkitgtk/. Using PyWebkit you can create a rendered view of the HTML page.

Rendering a web page is not an trivial task as web technology is changing constantly. Several rendering engines exist. Two of them are the most prominent: Webkit (Chrome/Chromium, Safari) and Gecko (Firefox). Also there is Trident (Internet Explorer) and Blink (Opera).

Google.com also contains Javascript which needs to be interpreted. It should render fine without Javascript, but probably will look differently.



来源:https://stackoverflow.com/questions/20622870/using-urllib2-to-execute-url-and-return-rendered-html-output-not-the-html-itsel

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!