urllib

Urlretrieve and User-Agent? - Python

女生的网名这么多〃 提交于 2019-12-03 10:36:17
I'm using urlretrieve from the urllib module. I cannot seem to find how to add a User-Agent description to my requests. Is it possible with urlretrieve? or do I need to use another method? First, set version: urllib.URLopener.version = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36 SE 2.X MetaSr 1.0' Then: filename, headers = urllib.urlretrieve(url) d.rey You can use URLopener or FancyURLopener classes. The 'version' argument specifies the user agent of the opener object. opener = FancyURLopener({}) opener.version = 'Mozilla/5.0 (Windows

ImportError: cannot import name unwrap

匿名 (未验证) 提交于 2019-12-03 09:14:57
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I have installed scrapy with pip install scrapy . But in python shell I am getting an ImportError: >>> from scrapy.spider import Spider Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python2.7/dist-packages/scrapy/__init__.py", line 56, in <module> from scrapy.spider import Spider File "/usr/local/lib/python2.7/dist-packages/scrapy/spider.py", line 7, in <module> from scrapy.http import Request File "/usr/local/lib/python2.7/dist-packages/scrapy/http/__init__.py", line 10, in <module> from scrapy

Requests, Mechanize, urllib fails but cURL works

醉酒当歌 提交于 2019-12-03 09:12:31
Whilst attempting to access this site through requests, I receive: ('Connection aborted.', error(54, 'Connection reset by peer')) I have also tried to access the site through mechanize and urllib, both failed. However cURL works fine (see end for code). I have tried requests.get() with combinations of parameters verify=True , stream=True and I have also tried a request with the cURL header. I tried to move to urllib / Mechanize as alternatives but both gave the same error. My code for requests is as follows: import requests import cookielib url = "https://datamuster.marketdatasuite.com/Account

page scraping to get prices from google finance

瘦欲@ 提交于 2019-12-03 09:12:11
I am trying to get stock prices by scraping google finance pages, I am doing this in python, using urllib package and then using regex to get price data. When I leave my python script running, it works initially for some time (few minutes) and then starts throwing exception [HTTP Error 503: Service Unavailable] I guess this is happening because on web server side it detects frequent page updates as a robot and throws this exception after a while.. is there a way around this, i.e. deleting some cookie or creating some cookie etc.. or even better if google gives some api, I want to do this in

Multithreading for faster downloading

与世无争的帅哥 提交于 2019-12-03 09:09:54
How can I download multiple links simultaneously? My script below works but only downloads one at a time and it is extremely slow. I can't figure out how to incorporate multithreading in my script. The Python script: from BeautifulSoup import BeautifulSoup import lxml.html as html import urlparse import os, sys import urllib2 import re print ("downloading and parsing Bibles...") root = html.parse(open('links.html')) for link in root.findall('//a'): url = link.get('href') name = urlparse.urlparse(url).path.split('/')[-1] dirname = urlparse.urlparse(url).path.split('.')[-1] f = urllib2.urlopen

python urllib error - AttributeError: &#039;module&#039; object has no attribute &#039;request&#039;

匿名 (未验证) 提交于 2019-12-03 08:59:04
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: I am trying out a tutorial code which fetches the html code form a website and prints it. I'm using python 3.4.0 on ubuntu. The code: import urllib.request page = urllib.request.urlopen("http://www.brainjar.com/java/host/test.html") text = page.read().decode("utf8") print(text) I saw previous solutions and tried them, I also tried importing only urllib but it still doesn't work. The error message displayed is as shown: Traceback (most recent call last): File "string.py", line 1, in <module> import urllib.request File "/usr/lib/python3.4

Python3: urllib.error.HTTPError: HTTP Error 403: Forbidden

匿名 (未验证) 提交于 2019-12-03 08:57:35
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试): 问题: Please, Help me! I am using Python3.3 and this code: import urllib.request import sys Open_Page = urllib.request.urlopen( "http://wowcircle.com" ).read().decode().encode('utf-8') And I take this: Traceback (most recent call last): File "C:\Users\1\Desktop\WCLauncer\reg.py", line 5, in <module> "http://forum.wowcircle.com" File "C:\Python33\lib\urllib\request.py", line 156, in urlopen return opener.open(url, data, timeout) File "C:\Python33\lib\urllib\request.py", line 475, in open response = meth(req, response) File "C:\Python33\lib\urllib

Batch downloading text and images from URL with Python / urllib / beautifulsoup?

独自空忆成欢 提交于 2019-12-03 08:39:27
I have been browsing through several posts here but I just cannot get my head around batch-downloading images and text from a given URL with Python. import urllib,urllib2 import urlparse from BeautifulSoup import BeautifulSoup import os, sys def getAllImages(url): query = urllib2.Request(url) user_agent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 1.0.3705)" query.add_header("User-Agent", user_agent) page = BeautifulSoup(urllib2.urlopen(query)) for div in page.findAll("div", {"class": "thumbnail"}): print "found thumbnail" for img in div.findAll("img"):

Urllib库的使用

我们两清 提交于 2019-12-03 05:28:17
urllib库 urllib3库 爬虫一般流程 urllib urllib 是一个用来处理网络请求的python标准库,它包含4个模块 urllib.request       请求模块,用于发起网络请求 request模块主要负责构造和发起网络请求,并在其中添加Headers,Proxy等,利用它可以模拟浏览器的请求发起过程 发起网络请求 、添加Headers 、操作cookie 、使用代理 1、urlopen方法          一个简单发送网络请求的方法 url:字符串格式的url data:默认会发送GET请求,当传入data参数时,则会发起POST请求,data参数是字节类型、者类文件对象或可迭代对象 timeout:设置超时(以秒为单位),如果请求超过设置时间,则抛出异常。timeout没有指定则用系统默认设置,timeout只对,http,https以及ftp连接起作用 2、Request对象 利用urlopen可以发起最基本的请求,但这几个简单的参数不足以构建一个完整的请求,可以利用更强大的Request对象来构建更加完整的请求 2.1、请求头添加        两种方式,一种可以添加多个,为字典类型,一种可以添加一个,为元组类型 通过urllib发送的请求会有一个默认的Headers:“User-Agent”:“Python-urllib/3.*”

ip

喜你入骨 提交于 2019-12-03 04:31:44
#代理ip的构建 import urllib.request ip="110.85.155.236:35127" proxy=urllib.request.ProxyHandler({"http":ip}) #print(proxy) opener=urllib.request.build_opener(proxy,urllib.request.HTTPHandler) urllib.request.install_opener(opener) url="Https://www.baidu.com" data=urllib.request.urlopen(url).read().decode("utf-8","ingore") print(len(data)) fh=open("C:\\Users\\何\\Desktop\\cold\\ip_baidu.html","w") fh.write(data) fh.close()#关一下 """ #同上一样 """ #代理ip的构建 import urllib.request ip="110.85.155.236:35127" proxy=urllib.request.ProxyHandler({"http":ip}) #print(proxy) opener=urllib.request.build_opener(proxy