Print code from web page with python and urllib

浪子不回头ぞ 提交于 2019-12-06 09:17:15

问题


I'm trying to use python and urllib to look at the code of a certain web page. I've tried and succeeded this at other webpages using the code:

from urllib import *
url = 
code = urlopen(url).read()
print code

But it returns nothing at all. My guess is it's because the page has a lot of javascripts? What to do?


回答1:


Dynamic client side generated pages (JavaScript)

You can not use urllib alone to see code that been rendered dynamically client side (JavaScript). The reason is that urllib only fetches the response from the server which is headers and the body (the actual code). Because of that I will not execute the client side code.

You can however use something like selenium to remote control a web browser (Chrome or Firefox). That will make it possible for you to scrap the page even though it renders with javascript.

Here is a sample of scraping with selenium: Using python with selenium to scrape dynamic web pages

But that is not your problem here

The problem with this site however seems to be that they don't want to be scraped. They block clients with certain http user-agent headers.

You can however get the code anyway if you fake the http headers. Use urllib2 instead of urllib like this:

import urllib2
req = urllib2.Request(url)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox')  # Add fake client
response = urllib2.urlopen(req)
print response.read()

But, they clearly don't want you to scrape their site, so you should consider if this is a good idea.



来源:https://stackoverflow.com/questions/17137330/print-code-from-web-page-with-python-and-urllib

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!